I previously used the script below to find all csv files in a folder and append them to a dataframe. Now I want to append specified files to a new dataframe.
#define path for all CSV files
path = r'C:filepath'
csv_files = glob.glob(os.path.join(path, "*.csv"))
li = []
#removes rows with missing data and appends file to data frame
for csv in csv_files:
df = pd.read_csv(csv, index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
what I would like to do is add something like:
file_list = ['name1', 'name2', 'name3']
To add only the files in the file list to the df.
I think I got it, thanks largely in part to gtomer.
for file in file_list:
try:
df = pd.read_csv(path + file + '.csv', index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
except:
print(file)
Once you have a list, you can loop through the items in the list and perform your desired action:
file_list = ['name1', 'name2', 'name3']
for csv in file_list:
df = pd.read_csv(csv, index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
Related
So I have 366 CSV files and I want to copy their second columns and write them into a new CSV file. Need a code for this job. I tried some codes available here but nothing works. please help.
Assuming all the 2nd columns are the same length, you could simply loop through all the files. Read them, save the 2nd column to memory and construct a new df along the way.
filenames = ['test.csv', ....]
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
filenames = glob.glob(r'D:/CSV_FOLDER' + "/*.csv")
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
This can accomplished with glob and pandas:
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(mylist[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,1]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,1]).columns
df[colName] = pd.DataFrame(t.iloc[:,1])
df.to_csv('output.csv', index=False)
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(csvList[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,0]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,0]).columns
df[colName] = pd.DataFrame(t.iloc[:,0])
df.to_csv('output.csv', index=False)
I have a folder containing 30 files, each of them containing thousands of rows. I would like to loop through the files, creating a dataframe containing each 10th row from each file. The resulting dataframe would contain rows 10, 20, 30, 40, etc. from the first file; rows 10, 20, 30, 40, etc. from the second file and so on.
For the moment I have:
all_files = glob.glob("DK_Frequency/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
that appends in a list the different files from the folder. But I don't know how to go further.
Any idea? thank you in advance.
This will slice the df with every 10th row using iloc and then append it to the final-df. At the end of the loop, the final_df should contain all the necessary rows
all_files = glob.glob("DK_Frequency/*.csv")
li = []
final_df = pd.DataFrame()
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
final_df.append(df.iloc[::10])
Pandas read_csv allows to keep only every 10th line with skiprows. So you could use:
all_files = glob.glob("DK_Frequency/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, skiprows = lambda x: 0 != x%10)
li.append(df)
global_df = pd.concat(li, ignore_index=True)
Assuming that all the csv files have the same structure, you could do as follows:
# -*- coding: utf-8 -*-
all_files = glob.glob("DK_Frequency/*.csv")
#cols_to_take is the list of column headers
cols_to_take = pd.read_csv(all_files[0]).columns
## create an empty dataframe
big_df = pd.DataFrame(col_to_take)
for csv in all_files:
df = pd.read_csv(csv)
indices = list(filter(lambda x: x % 10 == 0, df.index))
df = df.loc[indices].reset_index()
## append df to big_df
big_df = big_df.append(df, ignore_index=True)
I have a very unstructured folder, where a lot of files have no entries (just the row headers), but there is no data inside. I know that i can include them and they will not change anything, but the problem is that the headers are not the same everywhere, so every file includes some extra manual work for me.
Until now I now how to load all files in a specific folder with the following code:
import glob
path = r'C:/Users/...'
all_files = glob.glob(path+ "/*.csv")
li = []
for filename in all_files:
frame = pd.read_csv(filename, index_col=None, header=0, sep=';', encoding='utf-8', low_memory=False)
li.append(frame)
df = pd.concat(li, axis=0, ignore_index=True, sort=False)
How can I skip every file, which only has one row?
Modify this loop from:
for filename in all_files:
frame = pd.read_csv(filename, index_col=None, header=0, sep=';', encoding='utf-8', low_memory=False)
li.append(frame)
To:
for filename in all_files:
frame = pd.read_csv(filename, index_col=None, header=0, sep=';', encoding='utf-8', low_memory=False)
if len(frame) > 1:
li.append(frame)
That's what if statements are for.
I want to read csv files from a directory and assign each to a different dataframe. I have tried to do so like this:
path = r'C:\Users\A\Documents\Dash'
files = glob.glob(path + "/*.csv")
for file in files:
f'df{file}' = pd.read_csv(file, sep=',')
But of course I couldn't assign to a literal, but I don't realise a way to do this. I don't really care if each dataframe is numbered differently or with the name of the csv.
You could do in this way:
for index, file in enumerate(files):
vars()['df'+str(index)] = pd.read_csv(file, sep=',')
print (df0)
print (df1)
files = glob.glob(path + "/*.csv")
## using map & zip:
df_list = list(map(lambda x: pd.read_csv(x, sep=","), files)) # result in list
df_dict = dict(zip(files, df_list)) # result in dict
## using for loop:
# result in list
df_list = list()
for file in files:
df_list.append(pd.read_csv(file, sep=','))
# result in dict
df_dict = dict()
for file in files:
df_dict[file] = pd.read_csv(file, sep=',')
I have a for loop that imports all of the Excel files in the directory and merge them together in a single dataframe. However, I want to create a new column where each row takes the string of the filename of the Excel-file.
Here is my import and merge code:
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
df = df.append(data)
For example if first Excel file is named "file1.xlsx", I want all rows from that file to have value file1.xlsx in col3 (a new column). If the second Excel file is named "file2.xlsx" I want all rows from that file to have value file2.xlsx. Notice that there is no real pattern of the Excel files, and I just use those names as an example.
Many thanks
Create new column in loop:
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
data['col3'] = f
df = df.append(data)
Another possible solution with list comprehension:
dfs = [pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2']).assign(col3 = f)
for f in files]
df = pd.concat(dfs)