I have a very unstructured folder, where a lot of files have no entries (just the row headers), but there is no data inside. I know that i can include them and they will not change anything, but the problem is that the headers are not the same everywhere, so every file includes some extra manual work for me.
Until now I now how to load all files in a specific folder with the following code:
import glob
path = r'C:/Users/...'
all_files = glob.glob(path+ "/*.csv")
li = []
for filename in all_files:
frame = pd.read_csv(filename, index_col=None, header=0, sep=';', encoding='utf-8', low_memory=False)
li.append(frame)
df = pd.concat(li, axis=0, ignore_index=True, sort=False)
How can I skip every file, which only has one row?
Modify this loop from:
for filename in all_files:
frame = pd.read_csv(filename, index_col=None, header=0, sep=';', encoding='utf-8', low_memory=False)
li.append(frame)
To:
for filename in all_files:
frame = pd.read_csv(filename, index_col=None, header=0, sep=';', encoding='utf-8', low_memory=False)
if len(frame) > 1:
li.append(frame)
That's what if statements are for.
Related
I have over 20 SAS (sas7bdat) files all with same columns I want to read in Python.
I need an iterative process to read all the files and rbind into one big df.
This is what I have so far, but it throws an error saying no objects to concatenate.
import pyreadstat
import glob
import os
path = r'C:\\Users\myfolder' # or unix / linux / mac path
all_files = glob.glob(os.path.join(path , "/*.sas7bdat"))
li = []
for filename in all_files:
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, filename, chunksize= 10000, usecols=cols)
for df, meta in reader:
li.append(df)
frame = pd.concat(li, axis=0)
I found this answer to read in csv files helpful: Import multiple CSV files into pandas and concatenate into one DataFrame
So if one has too big sas data files and plans to append all of them into one df then:
#chunksize command avoids the RAM from crashing...
for filename in all_files:
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, filename, chunksize= 10000, usecols=cols)
for df, meta in reader:
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
I hope you can help me with this problem.
I am having issues with adding multiple CSV files in pandas.
I have 12 files of sales data that have the same columns (one for each month: Sales_January_2019, Sales_February_2019.... and so on until December).
I've tried the following code but seems not working, also the index number should be continuous and not reset after each file. I tried with reset_index() but also didn't work.
import pandas as pd
import glob
path = r'C:\Users\ricar\.spyder-py3\data' # my path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=0, header=0)
li.append(df)
df.reset_index(inplace=True)
frame = pd.concat(li, axis=0, ignore_index=True)
df.drop(columns = ['x_t', 'perf'], inplace=True)
print(df)
Try correcting your code like this:
import pandas as pd
import glob
path = r'C:\Users\ricar\.spyder-py3\data' # my path
files = glob.glob(path + "/*.csv")
# Make a list of dataframes
li = [pd.read_csv(file, index_col=0, header=0) for file in files]
# Concatenate dataframes and remove useless columns
df = pd.concat(li, axis=0, ignore_index=True)
df.drop(columns=["x_t", "perf"], inplace=True)
print(df)
I previously used the script below to find all csv files in a folder and append them to a dataframe. Now I want to append specified files to a new dataframe.
#define path for all CSV files
path = r'C:filepath'
csv_files = glob.glob(os.path.join(path, "*.csv"))
li = []
#removes rows with missing data and appends file to data frame
for csv in csv_files:
df = pd.read_csv(csv, index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
what I would like to do is add something like:
file_list = ['name1', 'name2', 'name3']
To add only the files in the file list to the df.
I think I got it, thanks largely in part to gtomer.
for file in file_list:
try:
df = pd.read_csv(path + file + '.csv', index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
except:
print(file)
Once you have a list, you can loop through the items in the list and perform your desired action:
file_list = ['name1', 'name2', 'name3']
for csv in file_list:
df = pd.read_csv(csv, index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
I have a folder containing 30 files, each of them containing thousands of rows. I would like to loop through the files, creating a dataframe containing each 10th row from each file. The resulting dataframe would contain rows 10, 20, 30, 40, etc. from the first file; rows 10, 20, 30, 40, etc. from the second file and so on.
For the moment I have:
all_files = glob.glob("DK_Frequency/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
that appends in a list the different files from the folder. But I don't know how to go further.
Any idea? thank you in advance.
This will slice the df with every 10th row using iloc and then append it to the final-df. At the end of the loop, the final_df should contain all the necessary rows
all_files = glob.glob("DK_Frequency/*.csv")
li = []
final_df = pd.DataFrame()
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
final_df.append(df.iloc[::10])
Pandas read_csv allows to keep only every 10th line with skiprows. So you could use:
all_files = glob.glob("DK_Frequency/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, skiprows = lambda x: 0 != x%10)
li.append(df)
global_df = pd.concat(li, ignore_index=True)
Assuming that all the csv files have the same structure, you could do as follows:
# -*- coding: utf-8 -*-
all_files = glob.glob("DK_Frequency/*.csv")
#cols_to_take is the list of column headers
cols_to_take = pd.read_csv(all_files[0]).columns
## create an empty dataframe
big_df = pd.DataFrame(col_to_take)
for csv in all_files:
df = pd.read_csv(csv)
indices = list(filter(lambda x: x % 10 == 0, df.index))
df = df.loc[indices].reset_index()
## append df to big_df
big_df = big_df.append(df, ignore_index=True)
I have a lot of files excel, I want to append multiple excel files using the following code:
import pandas as pd
import glob
import os
import openpyxl
df = []
for f in glob.glob("*.xlsx"):
data = pd.read_excel(f, 'Sheet1')
data.index = [os.path.basename(f)] * len(data)
df.append(data)
df = pd.concat(df)
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Excel files have this structure:
the output is the following:
Why does python alter the first column when concatenating excel files?
I think you need:
df = []
for f in glob.glob("*.xlsx"):
data = pd.read_excel(f, 'Sheet1')
name = os.path.basename(f)
#create Multiindex for not overwrite original index
data.index = pd.MultiIndex.from_product([[name], data.index], names=('files','orig'))
df.append(data)
#reset index for columns from MultiIndex
df = pd.concat(df).reset_index()
Another solution is use parameter keys in concat:
files = glob.glob("*.xlsx")
names = [os.path.basename(f) for f in files]
dfs = [pd.read_excel(f, 'Sheet1') for f in files]
df = pd.concat(dfs, keys=names).rename_axis(('files','orig')).reset_index()
What is same as:
df = []
names = []
for f in glob.glob(".xlsx"):
df.append(pd.read_excel(f, 'Sheet1'))
names.append(os.path.basename(f))
df = pd.concat(df, keys=names).rename_axis(('files','orig')).reset_index()
Last write to excel with no index and no columns names:
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1', index=False, header=False)
writer.save()