How import multiple text files to one dataframe in python? - python

I found here how to import multiple text files to one data frame. However, it gives an error. Files are with the names as footballseason1,footballseason2,footballseason3 ... (until footballseason5000)
import pandas as pd
import datetime as dt
import os, glob
os.chdir("~/Downloads/data")
filenames = [i for i in glob.glob("*.txt")]
FileNotFoundError: [Errno 2] No such file or directory: '~/Downloads/data'
However, if I try to import one file, everything is working and the directory is found
df = pd.read_csv("~/Downloads/data/footballseason1.txt", sep=",")
Could you help to fix the problem? and are there any ways to do it without changing directory and simply do all the steps using the path where all files are located?

Python's os does not understand ~ by default, so it needs to be expanded manually:
filenames = [i for i in glob.glob(os.path.expanduser("~/Downloads/data/*.txt"))]

You can use python's list comprehension and pd.concat like below
df = pd.concat([pd.read_csv(i, sep=',') for i in glob.glob("~/Downloads/data/*.txt", recursive=True)])

Via pathlib ->
import pandas as pd
from pathlib import Path
inp_path = Path("~/Downloads/data")
df = pd.concat([
pd.read_csv(txt_file, sep=',') for txt_file in inp_path.glob('*.txt')
])
With added check - >
import pandas as pd
from pathlib import Path
inp_path = Path("~/Downloads/data")
if inp_path.exists():
df = pd.concat([
pd.read_csv(txt_file, sep=',') for txt_file in inp_path.glob('*.txt')
])
else:
print('input dir doesn\'t exist please check path')

Importing Data from Multiple files
Now let’s see how can we import data from multiple files from a specific directory. There are many ways to do so, but I personally believe this is an easier and simpler way to use and also to understand especially for beginners.
1)First, we are going to import the OS and glob libraries. We need them to navigate through different working directories and getting their paths.
import os
import glob
2) We also need to import the pandas library as we need to work with data frames.
import pandas as pd
3) Let’s change our working directory to the directory where we have all the data files.
os.chdir(r"C:\Users\HARISH\Path_for_our_files")
4) Now we need to create a for loop which iterates through all the .csv file in the current working directory
filenames = [i for i in glob.glob("*.csv")]

Related

Export Glob list to CSV

I have a folder named "Photos" that contains several images. I am using Glob to list all these images along with their full directory paths. I can print the list and see the full list of paths, however, I am now struggling to export this list into a CSV with a single column. My code is as follows:
Import glob
for file in glob.glob(r"C:\Users\myself\Photos*"):
print(file)
Normally I would use Pandas to read CSVs by putting them into a dataframe, but for a glob list I am struggling
appreciate any guidance or help
You're close. Use this :
import glob
import pandas as pd
list_of_pictures = []
for file in glob.glob(r"C:\Users\myself\Photos\*"):
list_of_pictures.append(file)
pd.DataFrame(list_of_pictures).to_csv(r'path&name_of_your_csvfile.csv', index=False, header=None)
Or with pathlib :
from pathlib import Path
import pandas as pd
list_of_pictures=[]
for file in Path(r'C:\Users\myself\Photos').glob('**/*'):
list_of_pictures.append(str(file.absolute()))
pd.DataFrame(list_of_pictures).to_csv(r'path&name_of_your_csvfile.csv', index=False, header=None)

Using Python/Pandas to loop over several csv files in a folder and make changes to each csv file

I currently have several csv files in a folder. I am wanting to use Python to loop over the files in the folder and make small changes to each csv file. Please see my code below which is not currently working:
import os
import pandas as pd
folder_to_view = "C:/path"
for file in os.listdir(folder_to_view):
df = pd.read_csv(file)
df.columns = ['Location','Subscriber','Speed','IP','Start','End','Bytes','Test Status','Comment']
df.to_csv(file, index=False)
I imagine that the issue is not forming the path correctly as the renaming of columns should be fine. os.listdir() returns a list of the files within that directory without the directory name prepended, so try this:
import os
import pandas as pd
folder_to_view = "C:/path"
for file in os.listdir(folder_to_view):
full_path = f'{folder_to_view}/{file}'
df = pd.read_csv(full_path)
df.columns = ['Location','Subscriber','Speed','IP','Start','End','Bytes','Test Status','Comment']
df.to_csv(full_path, index=False)

Sort files inside folder by element

I'm working with a Python script that takes some CSV files inside a folder and merges the data inside these files, but the problem is when sorting the files.
I found a similar useful question and I try to use the answers for it, but they didn't work.
The reality is I can obtain the final file, but the sort method doesn't work as I expect. I'm using the numeric element in the name of each file that I want to sort, I also include an image from my console:
How can I resolve this issue?
my code is the following:
import pandas as pd
import os
import glob
import numpy as np
import re
from os import listdir
#files = glob.glob1('./separa_0-60/', '*' + '.csv')
# if you want sort files according to the digits included in the filename, you can do as following:
#data_files = sorted(files, key=lambda x:float(re.findall("(\d+)",x)[0]))
#data_files = sorted(glob.glob('./separa_0-60/resultados_nodos_*.csv'))
data_files = sorted(glob.glob('./separa_0-60/resultados_nodos_*.csv'), key=lambda x: float(re.findall("(\d+)",x)[0]))
#print(files)
print(data_files)
mergeddata = pd.concat(pd.read_csv(datafile, sep=';')
for datafile in data_files)
keep_col = [
"node_code",
"throughput[Mbps]",
"node_code.1",
"throughput[Mbps].1"
]
mergeddata2 = mergeddata[keep_col]
print(mergeddata2)
mergeddata2.to_csv('resul_nodos_final_separa0-60.csv', index=False)
I very much appreciate all the help, regards!
The problem is that the directory name "separa_0-60" has digits in it. The first result from your findall is that "0". Better to do a more specific search on the file name.
data_files = sorted(glob.glob('./separa_0-60/resultados_nodos_*.csv'),
key=lambda x: float(re.search(r"resultados_nodos_(\d+).csv$", x).group(1)))

How to import all csv files in one folder and make the filename the variable name in pandas?

I would like to automatically import all csv files that are in one folder as dataframes and set the dataframe's variable name to the respective filename.
For example, in the folder are the following three files: data1.csv, data2.csv and data3.csv
How can I automatically import all three files having three dataframes (data1, data2 and data3) as the result?
If you want to save dataframe as variable with own file name. But it is not secure. This could cause code injection.
import pandas
import os
path = "path_of_directory"
files = os.listdir(path) # Returns list of files in the folder which is specifed path
for file in files:
if file.endswith(".csv"):# Checking wheter file endswith .csv
# os.sep returns the separtor of operator system
exec(f"{file[:-4]} = pandas.read_csv({path}+{os.sep}+{file})")
You can loop over the directory using pathlib and build a dictionary of name->DataFrame, eg:
import pathlib
import pandas as pd
dfs = {path.stem: pd.read_csv(path) for path in pathlib.Path('thepath/').glob(*.csv')}
Then access as dfs['test1'] etc...
Since the answer that was given includes an exec command, and munir.aygun already warned you what could go wrong with that approach. Now I want to show you the way to do it as Justin Ezequiel or munir.aygun already suggested:
import os
import glob
import pandas as pd
# Path to your data
path = r'D:\This\is\your\path'
# Get all .csv files at your path
allFiles = glob.glob(path + "/*.csv")
# Read in the data from files and safe to dictionary
dataStorage = {}
for filename in allFiles:
name = os.path.basename(filename).split(".")[0]
dataStorage[name] = pd.read_csv(filename)
# Can be used then like this (for printing here)
if "data1" in dataStorage:
print(dataStorage["data1"])
Hope this can still be helpful.

Read in all csv files from a directory using Python

I hope this is not trivial but I am wondering the following:
If I have a specific folder with n csv files, how could I iteratively read all of them, one at a time, and perform some calculations on their values?
For a single file, for example, I do something like this and perform some calculations on the x array:
import csv
import os
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2] #Creates the array that will undergo a set of calculations
I know that I can check how many csv files there are in a given folder (check here):
import glob
for files in glob.glob("*.csv"):
print files
But I failed to figure out how to possibly nest the numpy.genfromtxt() function in a for loop, so that I read in all the csv files of a directory that it is up to me to specify.
EDIT
The folder I have only has jpg and csv files. The latter are named eventX.csv, where X ranges from 1 to 50. The for loop I am referring to should therefore consider the file names the way they are.
That's how I'd do it:
import os
directory = os.path.join("c:\\","path")
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
f=open(file, 'r')
# perform calculation
f.close()
Using pandas and glob as the base packages
import glob
import pandas as pd
glued_data = pd.DataFrame()
for file_name in glob.glob(directoryPath+'*.csv'):
x = pd.read_csv(file_name, low_memory=False)
glued_data = pd.concat([glued_data,x],axis=0)
I think you look for something like this
import glob
for file_name in glob.glob(directoryPath+'*.csv'):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
Edit
If you want to get all csv files from a folder (including subfolder) you could use subprocess instead of glob (note that this code only works on linux systems)
import subprocess
file_list = subprocess.check_output(['find',directoryPath,'-name','*.csv']).split('\n')[:-1]
for i,file_name in enumerate(file_list):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
# now you can use i as an index
It first searches the folder and sub-folders for all file_names using the find command from the shell and applies your calculations afterwards.
According to the documentation of numpy.genfromtxt(), the first argument can be a
File, filename, or generator to read.
That would mean that you could write a generator that yields the lines of all the files like this:
def csv_merge_generator(pattern):
for file in glob.glob(pattern):
for line in file:
yield line
# then using it like this
numpy.genfromtxt(csv_merge_generator('*.csv'))
should work. (I do not have numpy installed, so cannot test easily)
Here's a more succinct way to do this, given some path = "/path/to/dir/".
import glob
import pandas as pd
pd.concat([pd.read_csv(f) for f in glob.glob(path+'*.csv')])
Then you can apply your calculation to the whole dataset, or, if you want to apply it one by one:
pd.concat([process(pd.read_csv(f)) for f in glob.glob(path+'*.csv')])
The function below will return a dictionary containing a dataframe for each .csv file in the folder within your defined path.
import pandas as pd
import glob
import os
import ntpath
def panda_read_csv(path):
pd_csv_dict = {}
csv_files = glob.glob(os.path.join(path, "*.csv"))
for csv_file in csv_files:
file_name = ntpath.basename(csv_file)
pd_csv_dict['pd_' + file_name] = pd.read_csv(csv_file, sep=";", encoding='mac_roman')
locals().update(pd_csv_dict)
return pd_csv_dict
You can use pathlib glob functionality to list all .csv in a path, and pandas to read them.
Then it's only a matter of applying whatever function you want (which, if systematic, can also be done within the list comprehension)
import pands as pd
from pathlib import Path
path2csv = Path("/your/path/")
csvlist = path2csv.glob("*.csv")
csvs = [pd.read_csv(g) for g in csvlist ]
Another answer using list comprehension:
from os import listdir
files= [f for f in listdir("./") if f.endswith(".csv")]
You need to import the glob library and then use it like following:
import glob
path='C:\\Users\\Admin\\PycharmProjects\\db_conection_screenshot\\seclectors_absent_images'
filenames = glob.glob(path + "\*.png")
print(len(filenames))

Categories

Resources