Efficient ways to combine multiple .csv files using python - python

I currently have about 700 '.csv' files and want to combine them into one. Each file has three columns: 'date', 'time' and 'var'. I have to combine them based on two columns: date and name. I currently read them as dataframe. After combining them, the final file should have columns of date, name, var1, var2,var3...var700. I currently use pandas merge function, but it is supper slow, as the data is large. Is there any efficient way to combine the files? My current code is as follows:
for filename in os.listdir(signal_path):
filepath=os.path.join(signal_path,filename)
_temp = pd.read_pickle(filepath)
data.merge(_temp, how = 'left', on=['date','name'])
I have attached a
sample data , each file has different length.

this code will combine all the csv files if the files in the same path.
import pandas as pd
from pathlib import Path
dir = Path("../relevant_directory")
df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

use this:
import pandas as pd
import os
dir = '../relevant_directory/'
first = True
for folder, subfolders, files in os.walk(dir):
for f in files:
file = str(folder)+str(f)
if file.split('.')[-1] == 'csv':
if first:
data = pd.read_csv(file)
first = False
else:
df = pd.read_csv(file)
data = pd.merge(data, df, on=['date', 'name'])

Related

To create multiple data frames from multiple excel files using pandas

I have many excel files in a folder.
The task is to read these excel files as individual data frames using a loop and merge them according to key inserted by the user. I was able to get the names of the files in a list file_list.
How to iterate over list to load excel files as individual dataframes?
import pandas as pd
import os
os.chdir(r"C:\Users\user\Desktop\RPA")
file_list= []
for root, dirs, files in os.walk(r"C:\Users\user\Desktop\RPA"):
for file in files:
if file.endswith('.xlsx'):
file_list.append(file)
print(file_list)
Once you have the list of file names you can do something like:
df = pd.DataFrame()
for file in file_list:
this_df = pd.read_excel(file)
if len(this_df) == 0:
df = this_df
else:
df = pd.merge(left=df, right=this_df, how="inner", left_on="key", right_on="key")
Use the glob module for simple recursive traversal and parsing-
import glob
import pd
files = glob.glob('C:\\Users\\user\\Desktop\\RPA\\**\\*.xlsx',recursive = True)
excel_dfs = []
for f in files:
df = pd.read_excel(f)
excel_dfs.append(df)
import pandas as pd
import glob
path = r'C:\Users\user\Desktop\RPA'
d=[]
d = glob.glob(path + "/*.xlsx")
print(d)
df= []
for file in d:
c=pd.read_excel(file)
df.append(c)
df1 = pd.merge(*df, how="inner", on="Manager")
print(df1)

Loop pandas directory

I have many csv files in a directory with two column each
miRNA read_counts
miR1 10
miR1 5
miR2 2
miR2 3
miR3 100
I would like to sum read_counts if the miRNA id is the same.
Result:
miRNA read_counts
miR1 15
miR2 5
miR3 100
To do that I wrote a little script. However I don't know how to loop it through all my csv files so I don't have to copy paste file names and output each time. Any help will be very appreciated. Thanks for the help!
import pandas as pd
df = pd.read_csv("modified_LC1a_miRNA_expressed.csv")
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv('sum_LC1a_miRNA_expressed.csv')
Try looking into glob module.
from glob import glob
import os
path = "./your/path"
files = glob(os.path.join(path, "*.csv"))
dataframes = []
for file in files:
df = pd.read_csv(file)
# rest you would want to append these to dataframes
dataframes.append(df)
Then, use pd.concat to join the dataframes and perform the groupby operation.
EDIT 1:
Based on the request mentioned in the comment:
results = {}
for file in files:
df = pd.read_csv(file)
# perform operation
df_new = df.groupby('miRNA')['read_count'].sum()
results[file] = df_new
Not trying to steal the answer. I would have put this in a comment under #Asif Ali's answer if I had enough rep.
Assuming all input .csv files follow the format:
"modified_{rest_of_the_file_name}.csv"
And you want the outputs to be:
"sum_{same_rest_of_the_file_name}.csv"
import os
import glob
path = "./your/path"
files = glob.glob(os.path.join(path, "*.csv"))
for file in files:
df = pd.read_csv(file)
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv(file.split('modified')[:-1] + \
'sum' + \
'_'.join(file.split('modified')[-1:]))

Import multiple excel files and merge into single pandas df with source name as column

I'm trying to merge a bunch of xlsx files into a single pandas dataframe in python. Furthermore, I want to include a column that lists the source file for each row. My code is as follows:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import glob
import os
# get the path for where the xlsx files are
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# create new dataframe
df = pd.DataFrame()
# read data from files and add into dataframe
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet 1')
df['Source_file'] = f
df = df.append(data)
however, when I look at the 'Source_file' column it lists the final file it reads as the name for every row. I've spent way more time than I should trying to fix this. What am I doing wrong?
within your for loop you are writing over each iteration of df so you'll only get back the final file,
what you need to do is delcare a list before hand and append do that,
since you called glob lets use that as well.
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
df = pd.concat(dfs)
if you want to add the filename into the df too then,
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [os.path.basename(f) for f in files]
df = pd.concat(dfs,keys=file_names)
Using Pathlib module (recommended Python 3.4+)
from pathlib import Path
files = [f for f in Path.cwd().glob('*.xlsx')]
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [f.stem for f in files]
df = pd.concat(dfs,keys=file_names)
or as a one liner :
df = pd.concat([pd.read_excel(f) for f in Path.cwd().glob('*.xlsx')],keys=[f.stem for f in Path.cwd().glob('*.xlsx')],sort=False)

Iterating over files in a directory and remove rows from them based on other file

I am looking for a way to iterate over 30 files in directory and remove rows from them based on IDs in other file. The files contain two columns - ID and a value, without column names. The other file contains just a column with the IDs ("id") that should be removed("ids_toberemoved"). After the 30 files are cleaned I want to export them to other folder.
This is what I have so far:
import pandas as pd
import os
ids_toberemoved = pd.read_csv('F:\\ids.csv')
myPath = "F:\\Other"
filesList= []
for path, subdirs, files in os.walk(myPath):
for name in files:
filesList.append(os.path.join(name))
dataframes = []
for filename in filesList:
dataframes.append(pd.read_csv(filename))
for df in dataframes:
df_cleaned = df.merge(ids_toberemoved, left_index=True, right_on=['id'],
how='left', indicator=True)
df_cleaned[df_cleaned._merge != 'both']
I am missing something in the step where I iterate over the data frames and join them with 'ids_toberemoved', in order to delete the rows with the matching IDs. Also, I can't figure out how to store every single file, after the cleaning, to other folder.
Any help appreciated!
Try the following approach:
from pathlib import Path
myPath = Path("F:\\Other")
ids_toberemoved = pd.read_csv('F:\\ids.csv', squeeze=True)
res = pd.concatenate([pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
for f in myPath.glob("*.csv")],
ignore_index=True)
UPDATE: in order to clean the files and to export them separately as "filename_clean.csv":
_ = [pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
.to_csv(f.with_name(f"{f.stem}_clean{f.suffix}"), index=False)
for f in myPath.glob("*.csv")]

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file
Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.
You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

Categories

Resources