I wanted to delete specific rows from every single csv. files in my directory (i.e. from row 0 to 33), but I have 224 separate csv. files which need to be done. I would be happy if you help me how can I use one code to carry out this.
I think you can use glob and pandas to do this quite easily, I'm not sure if you want to write over your original files something I never recommend, so be careful as this code will do that.
import os
import glob
import pandas as pd
os.chdir(r'yourdir')
allFiles = glob.glob("*.csv") # match your csvs
for file in allFiles:
df = pd.read_csv(file)
df = df.iloc[33:,] # read from row 34 onwards.
df.to_csv(file)
print(f"{file} has removed rows 0-33")
or something along those lines..
This is a simple combination of two separate tasks.
First, you need to loop through all the csv files in a folder. See this StackOverflow answer for how to do that.
Next, within that loop, for each file, you need to modify the csv by removing rows. See this answer for how to read a csv, write a csv, and omit certain rows based on a condition.
One final aspect is that you want to omit certain line numbers. A good way to do this is with the enumerate function.
So code such as this will give you the line numbers.
import csv
input = open('first.csv', 'r')
output = open('first_edit.csv', 'w')
writer = csv.writer(output)
for i, row in enumerate(input):
if i > 33:
writer.writerow(row)
input.close()
output.close()
Iterate over CSV files and use Pandas to remove the top 34 rows of each file then save it to an output directory.
Try this code after installing pandas:
from pathlib import Path
import pandas as pd
source_dir = Path('path/to/source/directory')
output_dir = Path('path/to/output/directory')
for file in source_dir.glob('*.csv'):
df = pd.read_csv(file)
df.drop(df.head(34).index, inplace=True)
df.to_csv(output_dir.joinpath(file.name), index=False)
Related
#Purpose: to read CSV files from every every csv files in the directory. Filter the rows with the column that say 'fail" from the csv file. Then copy and paste those rows onto a new CSV file.
# import necessary libraries
from sqlite3 import Row
import pandas as pd
import os
import glob
import csv
# Writing to a CSV
# using Python
import csv
# the path to your csv file directory.
mycsvdir = 'C:\Users\'' #this is where all the csv files will be housed.
# use glob to get all the csv files
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csv_files:
# read the csv file
df = pd.read_csv(csvfile)
dataframes.append(df)
#print(row['roi_id'], row['result']) #roi_id is the column label for the first cell on the csv, result is the Jth column label
dataframes = dataframes[dataframes['result'].str.contains('fail')]
# print out to a new csv file
dataframes.to_csv('ROI_Fail.csv') #rewrite this to mirror the variable you want to save the failed rows in.
I tried running this script but im getting a couple of errors. First off, i know my indentation is off(newbie over here), and im getting a big error under my for loop saying that "csv_files" is not defined. Any help would be greatly appreciated.
There are two issues here:
The first one is kind of easy - The variable in the for loop should be csvfiles, not csv_files.
The second one (Which will show up when you fix the one above) is that you are treating a list of dataframes as a dataframe.
The object "dataframes" in your script is a list to which you are appending the dataframes created from the CSV files. As such, you cannot index them by the column name as you are trying to do.
If your dataframes have the same layout I'd recommend using df.concat to join all dataframes into a single one, and then filtering the rows as you did here.
full_dataframe = pd.concat(dataframes, axis=0)
full_dataframe = full_dataframe[full_dataframe['result'].str.contains('fail')]
As a tip for further posts I'd recommend you also post the full traceback from your program. It helps us understand exactly what error you had when executing your code.
So I have a several tables in the format of csv, I am using Python and the csv module. I want to extract a particular value, lets say column=80 row=109.
Here is a random example:
import csv
with open('hugetable.csv', 'r') as file:
reader = csv.reader(file)
print(reader[109][80])
I am doing this many times with large tables and I would like to avoid loading the whole table into an array (line 2 above) to ask for a single value. Is there a way to open the file, load the specific value and close it again? Would this process be more efficient than what I have done above?
Thanks for all the answers, all answers so far work pretty well.
You could try reading the file without csv library:
row = 108
column = 80
with open('hugetable.csv', 'r') as file:
header = next(file)
for _ in range(row-1):
_ = next(file)
line = next(file)
print(line.strip().split(',')[column])
You can try pandas to load only certain columns of your csv file
import pandas as pd
pd.read_csv('foo.csv',usecols=["column1", "column2"])
You could use pandas to load it
import pandas as pd
text = pd.read_csv('Book1.csv', sep=',', header=None, skiprows= 100, nrows=3)
print(text[50])
I have many txt files (which have been converted from pdf) in a folder. I want to create a csv/excel dataset where each text file will become a row. Right now I am opening the files in pandas dataframe and then trying to save it to a csv file. When I print the dataframe, I get one row per txt file. However, when saving to csv file, the texts get broken and create multiple rows/lines for each txt file rather than just one row. Do you know how I can solve this problem? Any help would be highly appreciated. Thank you.
Following is the code I am using now.
import glob
import os
import pandas as pd
file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="latin-1") as f_input:
corpus.append(f_input.read())
df = pd.DataFrame({'col':corpus})
print (df)
df.to_csv('K:\\out.csv')
Update
If this solution is not possible it would be also helpful to transform the data a bit in pandas dataframe. I want to create a column with the name of txt files, that is, the name of each txt file in the folder will become the identifier of the respective text file. I will then save it to tsv format so that the lines do not get separated because of comma, as suggested by someone here.
I need something like following.
identifier col
txt1 example text in this file
txt2 second example text in this file
...
txtn final example text in this file
Use
import csv
df.to_csv('K:\\out.csv', quoting=csv.QUOTE_ALL)
I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames.
This is my code for reading them.
#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")
Each time I run the kernel, I activate one of the files to be read.
I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.
Use for and format like this. I use this every single day:
number_of_files = 6
for i in range(1, number_of_files+1):
df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))
#your code here, do analysis and then the loop will return and read the next dataframe
You could use a list to hold all of the dataframes:
number_of_files = 6
dfs = []
for file_num in range(len(number_of_files)):
dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()
Then to get a certain dataframe use:
df1 = dfs[0]
Edit:
As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:
for file_num in range(len(number_of_files)):
with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
dfs.append(csv.reader(iter(f.readline, '')))
Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.
P.S.
You may need multi-threading to iterate through each one at the same time.
Loading/Editing/Saving: - using csv module
Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)
If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into #seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.
number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
#notice I'm opening the original file as f in mode 'r' for read only
#and the new file as nf in mode 'a' for append
with open(filename.format(str(file_num).zfill(2)), 'r') as f,
open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
#initialize the writer before looping every line
w = csv.writer(nf)
for row in csv.reader(f):
#do your "data cleaning" (THIS IS PER-LINE REMEMBER)
#save to file
w.writerow(row)
Note:
You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.
Pandas Approach - using chunks
PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.
Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs
number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
# Do your data cleaning
chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks
For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.
Use glob.glob to get all files with similar names:
import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
df = pd.read_csv(f)
# manipulate df
df.to_csv(f)
This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.
Note that this also only loads one file at a time.
Use os.listdir() to make a list of files you can loop through?
samplefiles = os.listdir(filepath)
for filename in samplefiles:
df = pd.read_csv(filename)
where filepath is the directory containing multiple csv's?
Or a loop that changes the filename:
for i in range(1, 7):
df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))
# import libraries
import pandas as pd
import glob
# store file paths in a variable
project_folder = r"C:\file_path\"
# Save all file path in a variable
all_files_paths = glob.glob(project_folder + "/*.csv")
# Create a list to save whole data
li = []
# Use list comprehension to iterate over all files; and append data in each file to list
list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]
# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)
I have a set of csv files I need to import into a pandas dataframe.
I have imported the filepaths as a list, FP, and I am using the following code to read the data:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
This is working great, but unfortunately there are no datetimestamps or file identifying attributes in the files. I need to know which file each record came from.
I tried adding this line, but this just returned the filename of the final file read:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
df['filename'] = i
I can imagine some messy multi-step solutions, but wondered if there was something more elegant I could do within my existing loop.
I'd do it this way:
df = pd.concat([pd.read_csv(f, header=None).assign(filename=f) for f in FP],
ignore_index=True)