I have a list of data frames:
all_df = ['df_0','df_1','df_2','df_3','df_4','df_5','df_6']
How can I call them from this list to do something like that:
for (df,names) in zip(all_df,names):
df.to_csv('output/{}.csv'.format(names))
When executing expectedly I'm getting error 'str' object has no attribute 'to_csv' since I'm giving string to the 'to_csv'.
How can I save several data-frames (or perform other actions on them) in the for loop?
Thanks!
Could you please also give an idea on how to create the 'right' list of data frames from this:
path = 'inp/multysheet_excel_01.xlsx'
xl = pd.ExcelFile(path)
sh_name = xl.sheet_names
n = 0
for i in sh_name:
exec('df_{} = pd.read_excel("{}", sheet_name="{}")'.format(n, path, i))
n+=1
so basically I'm trying to get the each sheet of an input excel as a separate dataframe, perform some actions on them, and save each output dataframe in separate excels.
You're quite close, but I see some mistakes in that for loop. Say you have a list of dataframes dfs, and its corresponding names as a list of strings names, you can save those dataframes using the names as:
dfs = [df_1, df_2, df_3]
names = ['df_0','df_1','df_2']
for df,name in zip(dfs,names):
df.to_csv('output\\{}.csv'.format(name))
Though if you only had a list of names, you could also do something like:
names = ['df_0','df_1','df_2']
for name in names:
globals()[name].to_csv('output\\{}.csv'.format(name))
Goal: Appending DataFrames in Loop to get combined dataframe.
df_base = pd.DataFrame(columns=df_col.columns)
file_path = 'DATA/'
filenames = ['savedrecs-23.txt', 'savedrecs-25.txt', 'savedrecs-24.txt']
For-Loop:
for file in filenames:
path = file_path+file
doc = codecs.open(path,'rU','UTF-8')
df_add = pd.read_csv(doc, sep='\t')
res = df_base.append(df_add)
res.shape
Expected Outcome:
(15, 67) ; all three data frames merged into one dataframe
Current Outcome:
(5, 67) ; just returns the last dataframe in the loop.
res = df_base.append(df_add)
Pandas append function does not modify the object it is called on. It returns a new object that contains the rows from the added dataframe appended onto the rows of the original dataframe.
Since you never modified df_base, so your output is just the frame from the last file, appended to the empty df_base dataframe.
Note that the pandas documentation doesn't recommend iteratively appending dataframes together. Instead, "a better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once." (with an example given)
I have the following files in AAMC_K.txt, AAU.txt, ACU.txt, ACY.txt in a folder called AMEX. I am trying to merge these text files into one dataframe. I have tried to do so with pd.merge() but I get an error that the merge function needs a right and left parameter and my data is in a python list. How can I merge the data in the data_list into one pandas dataframe.
import pandas as pd
import os
textfile_names = os.listdir("AMEX")
textfile_names.sort()
data_list = []
for i in range(len(textfile_names)):
data = pd.read_csv("AMEX/"+textfile_names[i], index_col=None, header=0)
data_list.append(data)
frame = pd.merge(data_list, on='<DTYYYYMMDD>', how='outer')
"AE.txt"
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
AAU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AAU,D,20020513,000000,0.4220,0.4220,0.4220,0.4220,0,0
AAU,D,20020514,000000,0.4177,0.4177,0.4177,0.4177,0,0
ACU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0
ACY.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACY,D,19980116,000000,9.7500,9.7500,8.8125,8.8125,289,0
ACY,D,19980120,000000,8.7500,8.7500,8.1250,8.1250,151,0
I want the output to be filtered with the DTYYYYMMDD and put into one dataframe frame.
OUTPUT
<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>,<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0,AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0,AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
As #busybear says, pd.concat is the right tool for this job: frame = pd.concat(data_list).
merge is for when you're joining two dataframes which usually have some of the same columns and some different ones. You choose a column (or index or multiple) which identifies which rows in the two dataframes correspond to each other, and pandas handles making a dataframe whose rows are combinations of the corresponding rows in the two original dataframes. This function only works on 2 dataframes at a time; you'd have to do a loop to merge more in (it's uncommon to need to merge many dataframes this way).
concat is for when you have multiple dataframes and want to just append all of their rows or columns into one large dataframe. (Let's assume you're concatenating rows, as you want here.) It doesn't use an identifier to determine which rows correspond. All it does is create a new dataframe which has each row from each of the concated dataframes (all the rows from the first, then all from the second, etc.).
I think the above is a decent TLDR on merge vs concat but see here for a lengthy but much more comprehensive guide on using merge/join/concat with dataframes.
I am having a list of json files within Databricks and what I am trying to do is to read each json, extract the values needed and then append that in an empty pandas dataframe. Each json file corresponds to one row on the final dataframe. The initial json filelist length is 50k. What I have built so far is the function below which does the job perfectly, but it takes so much time that it makes me subset the json filelist in 5k bins and run each one separately. It takes 30mins each. I am limited to use only a 3-node cluster in Databricks.
Any chance that you could improve the efficiency of my function? Thanks in advance.
### Create a big dataframe including all json files ###
def jsons_to_pdf(all_paths):
# Create an empty pandas dataframes (it is defined only with column names)
pdf = create_initial_pdf(samplefile)
# Append each row into the above dataframe
for path in all_paths:
# Create a spark dataframe
sdf = sqlContext.read.json(path)
# Create a two extracted lists of values
init_values = sdf.select("id","logTimestamp","otherTimestamp").rdd.flatMap(lambda x: x).collect()
id_values = sdf.select(sdf["dataPoints"]["value"]).rdd.flatMap(lambda x: x).collect()[0]
#Append the concatenated list each one as a row into the initial dataframe
pdf.loc[len(pdf)] = init_values + id_values
return pdf
One json file looks like the following:
And what I want to achieve is to have dataPoints['id'] as new columns and dataPoints['value'] as their value, so as to end up into this:
According to your example, what you want to perform is a pivot and then transform your data into a pandas dataframe.
The steps are :
Collect all you jsons into 1 big dataframe,
pivot your data,
transform them into a pandas dataframe
Try something like this :
from functools import reduce
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = reduce(
lambda a,b : a.union(b),
[
sqlContext.read.json(path)
for path
in all_paths
]
)
# select and pivot your data
pivot_df = sdf.select(
"imoNo",
"logTimestamp",
"payloadTimestamp",
F.explode("datapoints").alias("datapoint")
).groupBy(
"imoNo",
"logTimestamp",
"payloadTimestamp",
).pivot(
"datapoint.id"
).sum("datapoint.value")
# convert to a pandas dataframe
pdf = pivot_df.toPandas()
return pdf
According to your comment, you can replace the list of files all_paths with a generic path and change the way you create sdf:
all_paths = 'abc/*/*/*' # 3x*, one for year, one for month, one for day
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = sqlContext.read.json(path)
This will surely increase the performances.