Python: Parsing multiple csv files and skip files without a keyword - python

I am trying to read some .csv field data on python for post-processing, I typically just use something like:
for flist in glob('*.csv'):
df = pd.read_csv(flist, delimiter = ',')
However I need to filter through the bad files which contain "Run_Terminated" somewhere in the file and skip the file entirely. I'm still new to python so I'm not familiar with all of its functionalities, any input would be appreciated. Thank you.

What you could do is first read the file fully in memory (using a io.StringIO file-like object and look for the Run_Terminated string anywhere in the file (dirty, but should be OK),
Then pass the handle to read_csv (since you can pass a handle OR a filename) so you don't have to read it again from the file.
import pandas as pd
import glob
import io
for flist in glob('*.csv'):
with open(flist) as f:
data = io.StringIO()
data.write(f.read())
if "Run_Terminated" not in data.getvalue():
data.seek(0) # rewind or it won't read anything
df = pd.read_csv(data, delimiter = ',')

Related

Import pipe delimited txt file into spark dataframe in databricks

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.
When importing .csv files I am able to set the delimiter and header options. However, I am not able to get the .txt files to import in the same way.
Example Data (completely made up)... for ease, please imagine it is just called datafile.txt:
URN|Name|Supported
12233345757777701|Tori|Yes
32313185648456414|Dave|No
46852554443544854|Steph|No
I would really appreciate a hand in getting this imported into a Spark dataframe so that I can crack on with other parts of the analysis. Thank you!
Any delimiter separated file is a good candidate for csv reading methods. The 'c' of csv is mostly by convention. Thus nothing stops us from reading this:
col1|col2|col3
0|1|2
1|3|8
Like this (in pure python):
import csv
from pathlib import Path
with Path("pipefile.txt").open() as f:
reader = csv.DictReader(f, delimiter="|")
data = list(reader)
print(data)
Since whatever custom reader your libraries are using probably uses csv.reader under the hood you simply need to figure out how to pass the right separator to it.
#blackbishop notes in a comment that
spark.read.csv("datafile.text", header=True, sep="|")
would be the appropriate spark call.

Pandas - save multiple CSV in a zip archive

I need to save multiple dataframes in CSV, all in a same zip file.
Is it possible without making temporary files?
I tried using zipfile:
with zipfile.ZipFile("archive.zip", "w") as zf:
with zf.open(f"file1.csv", "w") as buffer:
data_frame.to_csv(buffer, mode="wb")
This works with to_excel but fails with to_csv as as zipfiles expects binary data and to_csv writes a string, despite the mode="wb" parameter:
.../lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 283, in _save_header
writer.writerow(encoded_labels)
.../lib/python3.8/zipfile.py", line 1137, in write
TypeError: a bytes-like object is required, not 'str'
On the other hand, I tried using the compression parameter of to_csv, but the archive is overwritten, and only the last dataframe remains in the final archive.
If no other way, I'll use temporary files, but I was wondering if someone have an idea to allow to_csv and zipfile work together.
Thanks in advance!
I would approach this following way
import io
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
string_io = io.StringIO()
df.to_csv(string_io)
string_io.seek(0)
df_bytes = string_io.read().encode('utf-8')
as df_bytes is bytes it should now work with zipfile. Edit: after looking into to_csv help I found simpler way, to get bytes namely:
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
df_bytes = df.to_csv().encode('utf-8')
For saving multiple excel files from dataframe in a zip file
import zipfile
from zipfile import ZipFile
import pandas as pd
df1 = pd.DataFrame({"x":[1,2,3]})
df2 = pd.DataFrame({"y":[4,5,6]})
df3 = pd.DataFrame({"z":[7,8,9]})
with zipfile.ZipFile("rishabh.zip", "w") as zf:
with zf.open(f"check1.xlsx", "w") as buffer:
df1.to_excel(buffer,index=False)
with zf.open(f"check2.xlsx", "w") as buffer:
df2.to_excel(buffer,index=False)
with zf.open(f"check3.xlsx", "w") as buffer:
df3.to_excel(buffer, index=False)

Grab values from seperate csv file and replace the values of columns in a pipe delimited file

Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.

Extract a particular value of csv file without uploading whole file

So I have a several tables in the format of csv, I am using Python and the csv module. I want to extract a particular value, lets say column=80 row=109.
Here is a random example:
import csv
with open('hugetable.csv', 'r') as file:
reader = csv.reader(file)
print(reader[109][80])
I am doing this many times with large tables and I would like to avoid loading the whole table into an array (line 2 above) to ask for a single value. Is there a way to open the file, load the specific value and close it again? Would this process be more efficient than what I have done above?
Thanks for all the answers, all answers so far work pretty well.
You could try reading the file without csv library:
row = 108
column = 80
with open('hugetable.csv', 'r') as file:
header = next(file)
for _ in range(row-1):
_ = next(file)
line = next(file)
print(line.strip().split(',')[column])
You can try pandas to load only certain columns of your csv file
import pandas as pd
pd.read_csv('foo.csv',usecols=["column1", "column2"])
You could use pandas to load it
import pandas as pd
text = pd.read_csv('Book1.csv', sep=',', header=None, skiprows= 100, nrows=3)
print(text[50])

Reading CSV files from zip archive with python-3.x

I have a zipped archive that contains several csv files.
For instance, assume myarchive.zip contains myfile1.csv, myfile2.csv, myfile3.csv
In python 2.7 I was able to load iteratively all the myfiles in pandas using
import pandas as pd
import zipfile
with zipfile.ZipFile(myarchive.zip, 'r') as zippedyear:
for filename in ['myfile1.csv', 'myfile2.csv', 'myfile3.csv']:
mydf = pd.read_csv(zippedyear.open(filename))
Now doing the same thing with Python 3 throws the error
ParserError: iterator should return strings, not bytes (did you open
the file in text mode?)
I am at a loss here. Any idea what is the issue?
Thanks!
Strange indeed, since the only mode you can specify is r/w (character modes).
Here's a workaround; read the file using file.read, load the data into a StringIO buffer, and pass that to read_csv.
from io import StringIO
with zipfile.ZipFile(myarchive.zip, 'r') as zippedyear:
for filename in ['myfile1.csv', 'myfile2.csv', 'myfile3.csv']:
with zippedyear.open(filename) as f:
mydf = pd.read_csv(io.StringIO(f.read()))

Categories

Resources