Manipulating multple CSV files at once - python

I am currently learning how to work with Python and for the moment I am very fond of working with CSV files. I managed to learn a few things and now I want to apply what I learned to multiple files at once. But something got me confused. I have this code:
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".csv"):
paths=os.path.join(root,file)
tables=pd.read_csv(paths, header='infer', sep=',')
print(paths)
print(tables)
It prints all the CSV files found in that folder in a certain format ( a kind of table with the first row being a header and the rest following under)
The trick is that I want to be able to access these anytime (print and edit) and what I wrote there only prints them ONCE. if I write print(paths) or prints(tables) anywhere else after that it only prints the LAST CSV file and its data, even though I believe it should do the same thing.
I also tried making similar separate codes for each print (tables and paths) but it only works for the first os.walk() - I just don`t get why it only works once.
Thank you!

You will want to store the DataFrames as you load them. Right now you are just loading and discarding.
dfs = []
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".csv"):
paths=os.path.join(root,file)
tables=pd.read_csv(paths, header='infer', sep=',')
dfs.append(tables)
print(paths)
print(tables)
The above will give you a list of DataFrames dfs that you can then access and utilize. Like so:
print(dfs[0])
# prints the first DataFrame you read in.
for df in dfs:
print(df)
# prints each DataFrame in sequence
Once you have the data stored you can do pretty much anything.

Related

Finding filenames in a folder that contain a variable from a list, opening the JSON files and performing actions on them

I'm working with JSON filetypes and I've created some code that will open a single file and add it to a pandas dataframe, performing some procedures on the data within, snipper of this code as follows;
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
The code then goes on to extract parts of the JSON data into dataframes, before merging and printing to CSV.
Where I want to develop the code, is to have it iterate through a folder first, find filenames that match my list of filenames that I want to work on and then perform the functions on those filenames. For example, I have a folder with 1000 docs, I will only need to perform the function on a sample of these.
I've created a list in CSV of the account codes that I want to work on, I've then imported the csv details and created a list of account codes as follows:
csv_file=open(r'C:\filepath','r')
cikas=[]
cikbs=[]
csv_file.readline()
for a,b,c in csv.reader(csv_file, delimiter=','):
cikas.append(a)
cikbs.append(b)
midstring=[s for s in cikbs]
print(midstring)
My account names are then stored in midstring, for example ['12345', '2468', '56789']. This means I can control which account codes are worked on by amending my CSV file in future. These names will vary at different stages hence I don't want to absolutely define them at this stage.
What I would like the code to do, is check the working directory, see if there is a file that matches for example C:\Users*12345.json. If there is, perform the pandas procedures upon it, then move to the next file. Is this possible? I've tried a number of tutorials involving glob, iglob, fnmatch etc but struggling to come up with a workable solution.
you can list all the files with .json extension in the current directory first.
import os, json
import pandas as pd
path_to_json = 'currentdir/'
json_files = [json_file for json_file in os.listdir(path_to_json) if json_file.endswith('.json')]
print(json_files)
Now iterate over the list of json_files and perform a check
# example list json_files= ['12345.json','2468.json','56789.json']
# midstring = ['12345', '2468, '56789']
for file in json_files:
if file.split('.')[0] in midstring:
df = pd.DataFrame.from_dict(json_file)
# perform pandas functions
else:
continue

python: error reading files into data frame

I'm trying to import multiple csv files in one folder into one data frame. This is my code. It can iterate through the files and print them successfully and it can read one file into a data frame but combining them is printing an error. I saw many questions similar but the responses are complex, I thought the 'pythonic' way is to be simple because I am new to this. Thanks in advance for any help. The error message is always: No such file or directory: 'some file name' which makes no sense because it successfully printed the file name in the print step.
import pandas as pd
# this works
df = pd.read_csv("headlines/2017-1.csv")
print(df)
path = 'C:/.../... /.../headlines/' <--- full path I shortened it here
files = os.listdir(path)
print(files) <-- prints all file names successfully
for filename in files:
print(filename) # <-- successfully prints all file names
df = pd.read_csv(filename) # < -- error here
df2.append(df) # append to data frame
It seems like your current working directory is different from your path. Please use
os.chdir(path) before attempting to read your csv.

Read csv files in multiple zip files by using one csv as an example and loop

I have multiple zip files in a folder and within the zip files are multiple csv files.
All csv files dont have all the columns but a few have all the columns.
How can I use the file that has all the columns as an example and then loop it to extract all the data into one dataframe and save it into one csv for further use?
The code I am following right now is as below:
import glob
import zipfile
import pandas as pd
dfs = []
for zip_file in glob.glob(r"C:\Users\harsh\Desktop\Temp\*.zip"):
zf = zipfile.ZipFile(zip_file)
dfs += [pd.read_csv(zf.open(f), sep=";", encoding='latin1') for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
print(df)
However, I am not getting the columns and headers at all. I am stuck at this stage.
If you'd like to know the file structure,
Please find the output of the code here and
The example csv file here.
If you would like to see my project files for this code, Please find the shared google drive link here
Also, at the risk of sounding redundant, why am I required to use the sep=";", encoding='latin1' part? The code gives me an error without it otherwise.

Looping through multiple CSV-files in different folders and producing multiple outputs and put these in the same folder based on input

Problem: I have around 100 folders in 1 main folder each with a csv file with the exact same name and structure, for example input.csv. I have written a Python script for one of the files which takes the csv-file as an input and gives two images as output in the same folder. I want to produce these images and put them in every folder given the input per folder.
Is there a fast way to do this? Until now I have copied my script every time in each folder and been executing it. For 5 folders it was alright, but for 100 this will get tedious and take a lot of time.
Can someone please help met out? I'm very new to coding w.r.t. directories, paths, files etc. I have already tried to look for a solution, but no succes so far.
You could try something like this:
import os
import pandas as pd
path = r'path\to\folders'
filename = 'input'
# get all directories within the top level directory
dirs = [os.path.join(path, val) for val in os.listdir(path) if os.path.isdir(os.path.join(path, val))]
# loop through each directory
for dd in dirs:
file_ = [f for f in os.listdir(dd) if f.endwith('.csv') and 'input' in f]
if file_:
# this assumes only a single file of 'filename' and extension .csv exists in each directory
file_ = file_[0]
else:
continue
data = pd.read_csv(file_)
# apply your script here and save images back to folder dd

Python: get the names of specific file types in a given folder

I have a folder which contains n csv files, and they are all named event_1.csv, event_2.csv, all the way up to event_n.csv.
I want to be able to raw_input their folder path and then retrieve their names, file extension excluded.
I tried to do it but I only get printed the folder path as many times as there are csv files in it. The filenames are never printed, instead, and they are what I want.
This is the best I have come up with so far, but it does the wrong thing:
import os
directoryPath=raw_input('Directory for csv files: ')
for i,file in enumerate(os.listdir(directoryPath)):
if file.endswith(".csv"):
print os.path.basename(directoryPath)
where directoryPath=C:\Users\MyName\Desktop\myfolder, in which there are three files: event_1.csv, event_2.csv, event_3.csv.
What I get is:
myfolder
myfolder
myfolder
What I want, instead, is:
event_1
event_2
event_3
PS: I expect to have as many as 100 files. Therefore I should be able to include situations like these, where the filename is longer:
event_10
event_100
EDIT
What can I add to make sure the files are read exactly in the same order as they are named? This means: first read event_1.csv, then event_2.csv, and so forth until I reach event_100.csv. Thanks!
It seems like you are getting what you ask for. See this line in your code:
print os.path.basename(directoryPath)
It prints the directoryPath.
I think it should be:
import os
directoryPath=raw_input('Directory for csv files: ')
for i,file in enumerate(os.listdir(directoryPath)):
if file.endswith(".csv"):
print os.path.basename(file)
Good luck!
EDIT:
Let's create a list of all file names without path and extension (l). Now:
for n in sorted(l, key=lambda x: int(x.split('_')[1])):
print n
Now you need to write your specific solution :)

Categories

Resources