Efficient ways to combine multiple .csv files using python

Efficient ways to combine multiple .csv files using python - python

I currently have about 700 '.csv' files and want to combine them into one. Each file has three columns: 'date', 'time' and 'var'. I have to combine them based on two columns: date and name. I currently read them as dataframe. After combining them, the final file should have columns of date, name, var1, var2,var3...var700. I currently use pandas merge function, but it is supper slow, as the data is large. Is there any efficient way to combine the files? My current code is as follows:
for filename in os.listdir(signal_path):
filepath=os.path.join(signal_path,filename)
_temp = pd.read_pickle(filepath)
data.merge(_temp, how = 'left', on=['date','name'])
I have attached a
sample data , each file has different length.

this code will combine all the csv files if the files in the same path.
import pandas as pd
from pathlib import Path
dir = Path("../relevant_directory")
df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

use this:
import pandas as pd
import os
dir = '../relevant_directory/'
first = True
for folder, subfolders, files in os.walk(dir):
for f in files:
file = str(folder)+str(f)
if file.split('.')[-1] == 'csv':
if first:
data = pd.read_csv(file)
first = False
else:
df = pd.read_csv(file)
data = pd.merge(data, df, on=['date', 'name'])

Related

To create multiple data frames from multiple excel files using pandas

I have many excel files in a folder.
The task is to read these excel files as individual data frames using a loop and merge them according to key inserted by the user. I was able to get the names of the files in a list file_list.
How to iterate over list to load excel files as individual dataframes?
import pandas as pd
import os
os.chdir(r"C:\Users\user\Desktop\RPA")
file_list= []
for root, dirs, files in os.walk(r"C:\Users\user\Desktop\RPA"):
for file in files:
if file.endswith('.xlsx'):
file_list.append(file)
print(file_list)

Once you have the list of file names you can do something like:
df = pd.DataFrame()
for file in file_list:
this_df = pd.read_excel(file)
if len(this_df) == 0:
df = this_df
else:
df = pd.merge(left=df, right=this_df, how="inner", left_on="key", right_on="key")

Use the glob module for simple recursive traversal and parsing-
import glob
import pd
files = glob.glob('C:\\Users\\user\\Desktop\\RPA\\**\\*.xlsx',recursive = True)
excel_dfs = []
for f in files:
df = pd.read_excel(f)
excel_dfs.append(df)

import pandas as pd
import glob
path = r'C:\Users\user\Desktop\RPA'
d=[]
d = glob.glob(path + "/*.xlsx")
print(d)
df= []
for file in d:
c=pd.read_excel(file)
df.append(c)
df1 = pd.merge(*df, how="inner", on="Manager")
print(df1)

Loop pandas directory

I have many csv files in a directory with two column each
miRNA read_counts
miR1 10
miR1 5
miR2 2
miR2 3
miR3 100
I would like to sum read_counts if the miRNA id is the same.
Result:
miRNA read_counts
miR1 15
miR2 5
miR3 100
To do that I wrote a little script. However I don't know how to loop it through all my csv files so I don't have to copy paste file names and output each time. Any help will be very appreciated. Thanks for the help!
import pandas as pd
df = pd.read_csv("modified_LC1a_miRNA_expressed.csv")
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv('sum_LC1a_miRNA_expressed.csv')

Try looking into glob module.
from glob import glob
import os
path = "./your/path"
files = glob(os.path.join(path, "*.csv"))
dataframes = []
for file in files:
df = pd.read_csv(file)
# rest you would want to append these to dataframes
dataframes.append(df)
Then, use pd.concat to join the dataframes and perform the groupby operation.
EDIT 1:
Based on the request mentioned in the comment:
results = {}
for file in files:
df = pd.read_csv(file)
# perform operation
df_new = df.groupby('miRNA')['read_count'].sum()
results[file] = df_new

Not trying to steal the answer. I would have put this in a comment under #Asif Ali's answer if I had enough rep.
Assuming all input .csv files follow the format:
"modified_{rest_of_the_file_name}.csv"
And you want the outputs to be:
"sum_{same_rest_of_the_file_name}.csv"
import os
import glob
path = "./your/path"
files = glob.glob(os.path.join(path, "*.csv"))
for file in files:
df = pd.read_csv(file)
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv(file.split('modified')[:-1] + \
'sum' + \
'_'.join(file.split('modified')[-1:]))

Import multiple excel files and merge into single pandas df with source name as column

I'm trying to merge a bunch of xlsx files into a single pandas dataframe in python. Furthermore, I want to include a column that lists the source file for each row. My code is as follows:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import glob
import os
# get the path for where the xlsx files are
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# create new dataframe
df = pd.DataFrame()
# read data from files and add into dataframe
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet 1')
df['Source_file'] = f
df = df.append(data)
however, when I look at the 'Source_file' column it lists the final file it reads as the name for every row. I've spent way more time than I should trying to fix this. What am I doing wrong?

within your for loop you are writing over each iteration of df so you'll only get back the final file,
what you need to do is delcare a list before hand and append do that,
since you called glob lets use that as well.
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
df = pd.concat(dfs)
if you want to add the filename into the df too then,
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [os.path.basename(f) for f in files]
df = pd.concat(dfs,keys=file_names)
Using Pathlib module (recommended Python 3.4+)
from pathlib import Path
files = [f for f in Path.cwd().glob('*.xlsx')]
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [f.stem for f in files]
df = pd.concat(dfs,keys=file_names)
or as a one liner :
df = pd.concat([pd.read_excel(f) for f in Path.cwd().glob('*.xlsx')],keys=[f.stem for f in Path.cwd().glob('*.xlsx')],sort=False)

Iterating over files in a directory and remove rows from them based on other file

I am looking for a way to iterate over 30 files in directory and remove rows from them based on IDs in other file. The files contain two columns - ID and a value, without column names. The other file contains just a column with the IDs ("id") that should be removed("ids_toberemoved"). After the 30 files are cleaned I want to export them to other folder.
This is what I have so far:
import pandas as pd
import os
ids_toberemoved = pd.read_csv('F:\\ids.csv')
myPath = "F:\\Other"
filesList= []
for path, subdirs, files in os.walk(myPath):
for name in files:
filesList.append(os.path.join(name))
dataframes = []
for filename in filesList:
dataframes.append(pd.read_csv(filename))
for df in dataframes:
df_cleaned = df.merge(ids_toberemoved, left_index=True, right_on=['id'],
how='left', indicator=True)
df_cleaned[df_cleaned._merge != 'both']
I am missing something in the step where I iterate over the data frames and join them with 'ids_toberemoved', in order to delete the rows with the matching IDs. Also, I can't figure out how to store every single file, after the cleaning, to other folder.
Any help appreciated!

Try the following approach:
from pathlib import Path
myPath = Path("F:\\Other")
ids_toberemoved = pd.read_csv('F:\\ids.csv', squeeze=True)
res = pd.concatenate([pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
for f in myPath.glob("*.csv")],
ignore_index=True)
UPDATE: in order to clean the files and to export them separately as "filename_clean.csv":
_ = [pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
.to_csv(f.with_name(f"{f.stem}_clean{f.suffix}"), index=False)
for f in myPath.glob("*.csv")]

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file

Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.

You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient ways to combine multiple .csv files using python - python

this code will combine all the csv files if the files in the same path. import pandas as pd from pathlib import Path dir = Path("../relevant_directory") df = (pd.read_csv(f) for f in dir.glob("*.csv")) df = pd.concat(df)

Related

To create multiple data frames from multiple excel files using pandas

Loop pandas directory

Import multiple excel files and merge into single pandas df with source name as column

Iterating over files in a directory and remove rows from them based on other file

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

Categories

Resources