Attempting to read data from multiple files to multiple arrays - python

I would like to be able to read data from multiple files in one folder to multiple arrays and then perform analysis on these arrays such as plot graphs etc. I am currently having trouble reading the data from these files into multiple arrays.
My solution process so far is as follows;
import numpy as np
import os
#Create an empty list to read filenames to
filenames = []
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
This works so far, what I'd like to do next is to iterate over the filenames in the list using numpy.genfromtxt.
I'm trying to use os.path join to put the individual list entry at the end of the path specified in listdir earlier. This is some example code:
for i in filenames:
file_name = os.path.join('C:\\entryfromabove','i')
'data_'+[i] = np.genfromtxt('file_name',skiprows=2,delimiter=',')
This piece of code returns "Invalid syntax".
To sum up the solution process I'm trying to use so far:
1. Use os.listdir to get all the filenames in the folder I'm looking at.
2. Use os.path.join to direct np.genfromtxt to open and read data from each file to a numpy array named after that file.
I'm not experienced with python by any means - any tips or questions on what I'm trying to achieve are welcome.

For this kind of task you'd want to use a dictionary.
data = {}
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
path = os.path.join('C:\\folderwherefileslive', i)
data[file] = np.genfromtxt(path, skiprows=2, delimiter=',')
# now you could for example access
data['foo.txt']
Notice, that everything you put within single or double quotes ends up being a character string, so 'file_name' will just be some characters, whereas using file_name would use the value stored in variable by that name.

Related

Modifying the order of reading CSV Files in Python according to the name

I have 1000 CSV files with names Radius_x where x stands for 0,1,2...11,12,13...999. But when I read the files and try to analyse the results, I wish to read them in the same order of whole numbers as listed above. But the code reads as follows (for example): ....145,146,147,148,149,15,150,150...159,16,160,161,...... and so on.
I know that if we rename the CSV files as Radius_xyz where xyz = 000,001,002,003,....010,011,012.....999, the problem could be resolved. Kindly help me as to how I can proceed.
To sort a list of paths numerically in python, first find all the files you are looking to open, then sort that iterable with a key which extracts the number.
With pathlib:
from pathlib import Path
files = list(Path("/tmp/so/").glob("Radius_*.csv")) # Path.glob returns a generator which needs to be put in a list
files.sort(key=lambda p: int(p.stem[7:])) # `Radius_` length is 7
files contains
[PosixPath('/tmp/so/Radius_1.csv'),
PosixPath('/tmp/so/Radius_2.csv'),
PosixPath('/tmp/so/Radius_3.csv'),
PosixPath('/tmp/so/Radius_4.csv'),
PosixPath('/tmp/so/Radius_5.csv'),
PosixPath('/tmp/so/Radius_6.csv'),
PosixPath('/tmp/so/Radius_7.csv'),
PosixPath('/tmp/so/Radius_8.csv'),
PosixPath('/tmp/so/Radius_9.csv'),
PosixPath('/tmp/so/Radius_10.csv'),
PosixPath('/tmp/so/Radius_11.csv'),
PosixPath('/tmp/so/Radius_12.csv'),
PosixPath('/tmp/so/Radius_13.csv'),
PosixPath('/tmp/so/Radius_14.csv'),
PosixPath('/tmp/so/Radius_15.csv'),
PosixPath('/tmp/so/Radius_16.csv'),
PosixPath('/tmp/so/Radius_17.csv'),
PosixPath('/tmp/so/Radius_18.csv'),
PosixPath('/tmp/so/Radius_19.csv'),
PosixPath('/tmp/so/Radius_20.csv')]
NB. files is a list of paths not strings, but most functions which deal with files accept both types.
A similar approach could be done with glob, which would give a list of strings not paths.

Finding filenames in a folder that contain a variable from a list, opening the JSON files and performing actions on them

I'm working with JSON filetypes and I've created some code that will open a single file and add it to a pandas dataframe, performing some procedures on the data within, snipper of this code as follows;
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
The code then goes on to extract parts of the JSON data into dataframes, before merging and printing to CSV.
Where I want to develop the code, is to have it iterate through a folder first, find filenames that match my list of filenames that I want to work on and then perform the functions on those filenames. For example, I have a folder with 1000 docs, I will only need to perform the function on a sample of these.
I've created a list in CSV of the account codes that I want to work on, I've then imported the csv details and created a list of account codes as follows:
csv_file=open(r'C:\filepath','r')
cikas=[]
cikbs=[]
csv_file.readline()
for a,b,c in csv.reader(csv_file, delimiter=','):
cikas.append(a)
cikbs.append(b)
midstring=[s for s in cikbs]
print(midstring)
My account names are then stored in midstring, for example ['12345', '2468', '56789']. This means I can control which account codes are worked on by amending my CSV file in future. These names will vary at different stages hence I don't want to absolutely define them at this stage.
What I would like the code to do, is check the working directory, see if there is a file that matches for example C:\Users*12345.json. If there is, perform the pandas procedures upon it, then move to the next file. Is this possible? I've tried a number of tutorials involving glob, iglob, fnmatch etc but struggling to come up with a workable solution.
you can list all the files with .json extension in the current directory first.
import os, json
import pandas as pd
path_to_json = 'currentdir/'
json_files = [json_file for json_file in os.listdir(path_to_json) if json_file.endswith('.json')]
print(json_files)
Now iterate over the list of json_files and perform a check
# example list json_files= ['12345.json','2468.json','56789.json']
# midstring = ['12345', '2468, '56789']
for file in json_files:
if file.split('.')[0] in midstring:
df = pd.DataFrame.from_dict(json_file)
# perform pandas functions
else:
continue

Searching for filenames containing multiple keywords in python

I have a directory containing a large number of files. I want to find all files, where the file name contains specific strings (e.g. a certain ending like '.txt', a model ID 'model_xy', etc.) as well as one of the entries in an integer array (e.g. a number of years I would like to select).
I tried this the following way:
import numpy as np
import glob
startyear = 2000
endyear = 2005
timerange = str(np.arange(startyear,endyear+1))
data_files = []
for file in glob.glob('/home/..../*model_xy*'+timerange+'.txt'):
data_files.append(file);
print(data_files)
Unfortunately, like this, other files outside of my 'timerange' are still selected.
You can use regex in glob.glob. Moreover, glob.glob returns a list so you don't need to iterate through it and append to new list.
import glob
data_files = glob.glob("/home/..../*model_xy*200[0-5].txt")
# or if you want to do recursive search you can use **
# Below code will search for all those files under /home/ recursively
data_file = glob.glob("/home/**/*model_xy*200[0-5].txt")

How to import "dat" file including only certain strings in python

I am trying to import several "dat" files to python spyder.
dat_file_list_images
Here are dat files listed, and there are two types on the list ended with _1 and _2. I wanna import dat files ended with "_1" only.
Is there any way to import them at once with one single loop?
After I import them, I would like to aggregate all to one single matrix.
import os
files_to_import = [f for f in os.listdir(folder_path)
if f.endswith("1")]
Make sure that you know whether the files have a .dat-extension or not - in Windows Explorer, the default setting is to hide file endings, and this will make your code fail if the files have a different ending.
What this code does is called list comprehension - os.listdir() provides all the files in the folder, and you create a list with only the ones that end with "1".
Uses str.endswith() it will return true if the entered string is ended with checking string
According to this website
Syntax: str.endswith(suffix[, start[, end]])
For your case:
You will need a loop to get filenames as String and check it while looping if it ends with "_1"
yourFilename = "yourfilename_1"
if yourFilename.endswith("_1"):
# do your job here

how to read multiple csv files in a directory through python csv() function?

In one of my directory, I have multiple CSV files. I wanted to read the content of all the CSV file through a python code and print the data but till now I am not able to do so.
All the CSV files have the same number of columns and the same column names as well.
I know a way to list all the CSV files in the directory and iterate over them through "os" module and "for" loop.
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
Now use the "csv" module to read the files name
reader = csv.reader(files)
till here I expect the output to be the names of the CSV files. which happens to be sorted. for example, names are 1.csv, 2.csv so on. But the output is as below
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
if I add next() function after the csv.reader(), I get below output
['1']
['2']
['3']
['4']
['5']
['6']
This happens to be the initials of my CSV files name. Which is partially correct but not fully.
Apart from this once I have the files iterated, how to see the contents of the CSV files on the screen? Today I have 6 files. Later on, I could have 100 files. So, it's not possible to use the file handling method in my scenario.
Any suggestions?
The easiest way I found during developing my project is by using dataframe, read_csv, and glob.
import glob
import os
import pandas as pd
folder_name = 'train_dataset'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True)
Here, all the csv files are loaded into 1 big dataframe.
I would recommend reading your CSVs using the pandas library.
Check this answer here: Import multiple csv files into pandas and concatenate into one DataFrame
Although you asked for python in general, pandas does a great job at data I/O and would help you here in my opinion.
till here I expect the output to be the names of the CSV files
This is the problem. csv.reader objects do not represent filenames. They represent lazy objects which may be iterated to yield rows from a CSV file. Or, if you wish to print the entire CSV file, you can call list on the csv.reader object:
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
reader = csv.reader(files)
print(list(reader))
if I add next() function after the csv.reader(), I get below output
Yes, this is what you should expect. Calling next on an iterator will give you the next value which comes out of that iterator. This would be the first line of each file. For example:
from io import StringIO
import csv
some_file = StringIO("""1
2
3""")
with some_file as fin:
reader = csv.reader(fin)
print(next(reader))
['1']
which happens to be sorted. for example, names are 1.csv, 2.csv so on.
This is either a coincidence or a correlation between the filename and the contents of the respective file. Calling next(reader) will not output part of a filename.
Apart from this once I have the files iterated, how to see the
contents of the csv files on the screen?
Use the print command, as in the examples above.
Today I have 6 files. Later on, I could have 100 files. So, it's not
possible to use the file handling method in my scenario.
This is not true. You can define a function to print all or part or your csv file. Then call that function in a for loop with filename as an input.
If you want to import your files as separate dataframes, you can try this:
import pandas as pd
import os
filenames = os.listdir("../data/") # lists all csv files in your directory
def extract_name_files(text): # removes .csv from the name of each file
name_file = text.strip('.csv').lower()
return name_file
names_of_files = list(map(extract_name_files,filenames)) # creates a list that will be used to name your dataframes
for i in range(0,len(names_of_files)): # saves each csv in a dataframe structure
exec(names_of_files[i] + " = pd.read_csv('../data/'+filenames[i])")
You can read and store several dataframes into separate variables using two lines of code.
import pandas as pd
datasets_list = ['users', 'calls', 'messages', 'internet', 'plans']
users, calls, messages, internet, plans = [(pd.read_csv(f'datasets/{dataset_name}.csv')) for dataset_name in datasets_list]

Categories

Resources