I have multiple CSV files of time series data - each file is for each day of the month. Every file has one datetime column followed by multiple other columns. I want to merge (not pd.concat) all the files into one data frame. How can I loop to select the files from the folder and merge it?
Basically, what I am trying to do is, read the first 2 CSV files and merge it into a dataframe ('df_merged'), then pick the third csv file and merge it with the dataframe created and then pick the fourth csv file and merge it to the dataframe. Could you please tell me how I can do it using a loop?
Note : The first merge should not be a outer merge.
What I have so far is without the for loop. Code for just 4 files
## Reading the csv files
df_file_one = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-01.csv")
df_file_two = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-02.csv")
df_file_three = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-03.csv")
df_file_four = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-04.csv")
#Merging the files
df_merged = pd.merge(df_file_one, df_file_two, left_index=True, right_index=True) # The first merge should not be outer merge.
df_merged = pd.merge(df_merged, df_file_three, left_index=True, right_index=True,how='outer')
df_merged = pd.merge(df_merged, df_file_four, left_index=True, right_index=True, how = 'outer')
If the order of files doesn't matter then we can loop through the files like this:
from glob import glob
import pandas as pd
files = glob("C:/Users/Desktop/2022/prices_*.csv")
df = pd.merge(
pd.read_csv(files.pop()),
pd.read_csv(files.pop()),
left_index=True,
right_index=True
)
while files:
df = pd.merge(
df,
pd.read_csv(files.pop()),
left_index=True,
right_index=True,
how='outer'
)
If the order is meaningful but all you have is the files for a single month named like prices_Jan_2022-01-02.csv then we can use the natural lexicographic order like this:
files = sorted(glob(...), reverse=True)
If more then one month is presented, but the name pattern is still the same, then we can use the yyyy-mm-dd pattern at the end of names as a key to sort files:
files = sorted(glob(...), key=lambda f: f[-14, -4], reverse=True)
All the other code stay the same.
for professional purposes I need to produce some reports that includes new entries every week.
I have 16 dataframes having same column names (df names are week1, week2... week16).
I created a list of the dataframes and then a loop. I wanted to test rename of column with index = 1 and I did not succeed.
[35] lists = [week1, week2, week3, week4, week5, week6, week7, week8, week9, week10,
week11, week12, week13, week14, week15, week16]
[36] for i in lists:
i.rename(columns={'1':'State'},inplace=True)
I am forced to manually change every column name because I can't set up the loop. Besides this is only one of the columns.
How can I make sure I can call all dataframes in a loop?
I have tried also the suggestions in this thread but somehow it did not work to me to use the append method. Indeed, the column names aren't edited after I run the script.
Thanks for the help!
Sample dataframes:
#import numpy as np
#import pandas as pd
df1 = pd.DataFrame(np.random.randn(10,10),columns=range(10))
df2 = pd.DataFrame(np.random.randn(10,10),columns=range(10))
df3 = pd.DataFrame(np.random.randn(10,10),columns=range(10))
df_list = [df1, df2, df3]
Try via list comprehension and rename() method:
df_list =[x.rename(columns={1:'State'}) for x in df_list]
Summary of Problem
Short Version
How do I go from a Dask Bag of Pandas DataFrames, to a single Dask DataFrame?
Long Version
I have a number of files that are not readable by any of dask.dataframe's various read functions (e.g. dd.read_csv or dd.read_parquet). I do have my own function that will read them in as Pandas DataFrames (function only works on one file at a time, akin to pd.read_csv). I would like to have all of these single Pandas DataFrames in one large Dask DataFrame.
Minimum Working Example
Here's some example CSV data (my data isn't actually in CSVs, but using it here for ease of example). To create a minimum working example, you can save this as a CSV and make a few copies, then use the code below
"gender","race/ethnicity","parental level of education","lunch","test preparation course","math score","reading score","writing score"
"female","group B","bachelor's degree","standard","none","72","72","74"
"female","group C","some college","standard","completed","69","90","88"
"female","group B","master's degree","standard","none","90","95","93"
"male","group A","associate's degree","free/reduced","none","47","57","44"
"male","group C","some college","standard","none","76","78","75"
from glob import glob
import pandas as pd
import dask.bag as db
files = glob('/path/to/your/csvs/*.csv')
bag = db.from_sequence(files).map(pd.read_csv)
What I've tried so far
import pandas as pd
import dask.bag as db
import dask.dataframe as dd
# Create a Dask bag of pandas dataframes
bag = db.from_sequence(list_of_files).map(my_reader_function)
df = bag.map(lambda x: x.to_records()).to_dataframe() # this doesn't work
df = bag.map(lambda x: x.to_dict(orient = <any option>)).to_dataframe() # neither does this
# This gets me really close. It's a bag of Dask DataFrames.
# But I can't figure out how to concatenate them together
df = bag.map(dd.from_pandas, npartitions = 1)
df = dd.from_delayed(bag) # returns an error
I recommend using dask.delayed with dask.dataframe. There is a good example doing what you want to do here:
https://docs.dask.org/en/latest/delayed-collections.html
Here are two additional possible solutions:
1. Convert the bag to a list of dataframes then use dd.multi.concat:
bag #a dask bag of dataframes
list_of_dfs = bag.compute()
df = dd.multi.concat(list_of_dfs).compute()
2. Convert to a bag of dictionaries and use bag.to_dataframe:
bag_of_dicts = bag.map(lambda df: df.to_dict(orient='records')).flatten()
df = bag_of_dicts.to_dataframe().compute()
In my own specific use case, option #2 had better performance than option #1.
If you already have a bag of dataframes then you can do the following:
Convert bag to delayed partitions,
convert delayed partitions to delayeds of dataframes by concatenating,
create dataframe from these delayeds.
In python code:
def bag_to_dataframe(bag, **concat_kwargs):
partitions = bag.to_delayed()
dataframes = map(
dask.delayed(lambda partition: pandas.concat(partition, **concat_kwargs)),
partitions
)
return dask.dataframe.from_delayed(dataframes)
You might want to control the concatenation of partitions, for example to ignore the index.
I'm trying read many txt files into my data frame and this code works below. However, it duplicates some of my columns, not all of them. I couldn't find a solution. What can I do to prevent this?
import pandas as pd
import glob
dfs = pd.DataFrame(pd.concat(map(functools.partial(pd.read_csv, sep='\t', low_memory=False),
glob.glob(r'/folder/*.txt')), sort=False))
Let's say my data should look like this:
enter image description here
But it looks like this:
enter image description here
I don't want my columns to be duplicated.
Could you give us a bit more information? Especially the output of dfs.columns would be useful. I suspect there could be some extra spaces in your column names which would cause pandas to differ between those.
Also you could try dask for that:
import dask.dataframe as dd
dfs = dd.read_csv(r'/folder/*.text, sep='\t').compute()
is a bit simpler and should give the same result
It is important to think about the concat process as having two possible outcomes. By choosing the axis, you can add new columns like the example (I) below or as new rows illustrated in example (II). pd.concat lets you do this by setting the axis to either 0 (rows) or 1 (columns).
Read more in the excellent documentation: concat
Example I:
import pandas as pd
import glob
pd.concat([pd.read_csv(f) for f in glob.glob(r'/folder/*.txt')], axis=1)
Example II:
pd.concat([pd.read_csv(f) for f in glob.glob(r'/folder/*.txt')], axis=0)
I have the following files in AAMC_K.txt, AAU.txt, ACU.txt, ACY.txt in a folder called AMEX. I am trying to merge these text files into one dataframe. I have tried to do so with pd.merge() but I get an error that the merge function needs a right and left parameter and my data is in a python list. How can I merge the data in the data_list into one pandas dataframe.
import pandas as pd
import os
textfile_names = os.listdir("AMEX")
textfile_names.sort()
data_list = []
for i in range(len(textfile_names)):
data = pd.read_csv("AMEX/"+textfile_names[i], index_col=None, header=0)
data_list.append(data)
frame = pd.merge(data_list, on='<DTYYYYMMDD>', how='outer')
"AE.txt"
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
AAU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AAU,D,20020513,000000,0.4220,0.4220,0.4220,0.4220,0,0
AAU,D,20020514,000000,0.4177,0.4177,0.4177,0.4177,0,0
ACU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0
ACY.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACY,D,19980116,000000,9.7500,9.7500,8.8125,8.8125,289,0
ACY,D,19980120,000000,8.7500,8.7500,8.1250,8.1250,151,0
I want the output to be filtered with the DTYYYYMMDD and put into one dataframe frame.
OUTPUT
<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>,<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0,AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0,AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
As #busybear says, pd.concat is the right tool for this job: frame = pd.concat(data_list).
merge is for when you're joining two dataframes which usually have some of the same columns and some different ones. You choose a column (or index or multiple) which identifies which rows in the two dataframes correspond to each other, and pandas handles making a dataframe whose rows are combinations of the corresponding rows in the two original dataframes. This function only works on 2 dataframes at a time; you'd have to do a loop to merge more in (it's uncommon to need to merge many dataframes this way).
concat is for when you have multiple dataframes and want to just append all of their rows or columns into one large dataframe. (Let's assume you're concatenating rows, as you want here.) It doesn't use an identifier to determine which rows correspond. All it does is create a new dataframe which has each row from each of the concated dataframes (all the rows from the first, then all from the second, etc.).
I think the above is a decent TLDR on merge vs concat but see here for a lengthy but much more comprehensive guide on using merge/join/concat with dataframes.