I have a folder with hundreds of csv files with 9 values from a temperature sensor in it. The columns are sensor_id, lat, lon (for the coordinates) and some other stuff that i don't need. The columns that i need are just the three [timestamp, temperature and humidity].
I already tried to use a module to import just the columns that i want and
i tried to delete the columns that i dont want with loops.
slowly i despair, can someone help me pls?
If you are open to use Pandas, you can do it simply by using usecols parameter, while reading the csv file.
df = pd.read_csv('your_file/path/file.csv', usecols=['col1', 'col2'])
print(df.shape)
df.head()
Here's some code that should do it, just add in your target directory and change the numbers on the last line to the index of the column you want (with the first column being 0):
import os
import csv
targetdir = "" # fill this in
allrows = []
files = os.listdir(targetDir)
for file in files:
with open('innovators.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
allrows.append([row[1], row[3], row[5]])
Related
So I'm trying to add a header to a csv file dynamically. My current code looks like the following:
import csv
from datetime import datetime
import pandas as pd
rows = []
with open(r'Test_Timestamp.csv', 'r', newline='') as file:
with open(r'Test_Timestamp_Result.csv', 'w', newline='') as file2:
reader = csv.reader(file, delimiter=',')
for row in reader:
rows.append(row)
file_write = csv.writer(file2)
for val in rows:
current_date_time = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
val.insert(0, current_date_time)
file_write.writerow(val)
Currently how this works is it inserts a new timestamp at column A which is exactly what I want it to do, as I want everything to be pushed as I'll be working with csv files with various different number of columns.
What I'm having trouble with is, how am I able to add a column header? Currently a timestamp is created next to the header. I would want to create a new header named: Execution_Date
I have looked at pandas as a solution but from the documentation I've seen the examples given looks like its a set of column headers already pre-determined. I've tried inserting a column header with df.insert(0, "Execution_Date", current_date_time) but gives me an error when trying to accomplish this.
I know I'm fairly close to doing this but I'm running into errors. Is there a way to do this dynamically so it automatically does this with various different csv files and number of different columns in each csv file, etc.? The current output looks like:
What I want the final result to look like is:
Any help with this would be greatly appreciated! I'm going to continue to see if I can solve this in the meantime, but I'm at a wall with how to proceed.
If the end result is something that excel can read like maybe a csv you can likely bypass pandas altogether:
Edit: adding support for existing titles
Given a simple csv like:
Title,Other
Geeks1,foo
Geeks2,bar
Then you might use:
import contextlib
import csv
from datetime import datetime
with contextlib.ExitStack() as stack:
file_in = open('Test_Timestamp.csv', "r", encoding="utf-8")
file_out = open('Test_Timestamp_Result.csv', "w", encoding="utf-8", newline="")
reader = csv.reader(file_in, delimiter=',')
writer = csv.writer(file_out)
writer.writerow(["Execution_Date"] + next(reader))
writer.writerows(
[datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)] + row
for row in reader
)
to give you a file like:
Execution_Date,Title,Other
2022-02-11 00:00:00,Geeks1,foo
2022-02-11 00:00:00,Geeks2,bar
One way to do this is to utilize to_csv().
Example:
# importing python package
import pandas as pd
# read contents of csv file
file = pd.read_csv("gfg.csv")
print("\nOriginal file:")
print(file)
# adding header
headerList = ['id', 'name', 'profession']
# converting data frame to csv
file.to_csv("gfg2.csv", header=headerList, index=False)
# display modified csv file
file2 = pd.read_csv("gfg2.csv")
print('\nModified file:')enter code here
print(file2)
Good evening,
I'm having a problem with a code I'm writing, and I would love to get advice. I want to do the following:
Remove rows in a .csv file that contain a specific value (-3.4028*10^38)
Write a new .csv
The file I'm working with is large (12.2 GB, 87 million rows), and has 6 columns within it, with the first 5 columns being numerical values, and the last value containing text.
Here is my code:
import csv
directory = "/media/gman/Folder1/processed/test_removal1.csv"
with open('run1.csv', 'r') as fin, open(directory, 'w', newline='') as fout:
# define reader and writer objects
reader = csv.reader(fin, skipinitialspace=False)
writer = csv.writer(fout, delimiter=',')
# write headers
writer.writerow(next(reader))
# iterate and write rows based on condition
for i in reader:
if (i[-1]) == -3.4028E38:
writer.writerow(i)
When I run this I get the following error message:
Error: line contains NUL
File "/media/gman/Aerospace_Classes/Programs/csv_remove.py", line 19, in <module>
for i in reader: Error: line contains NUL
I'm not sure how to proceed. If anyone has any suggestions, please let me know. Thank you.
I figured it out. Here is what I ended up doing:
#IMPORT LIBRARIES
import pandas as pd
#IMPORT FILE PATH
directory = '/media/gman/Grant/Maps/processed_maps/csv_combined.csv'
#CREATE DATAFRAME FROM IMPORTED CSV
data = pd.read_csv(directory)
data.head()
data.drop(data[data.iloc[:,2] < -100000].index, inplace=True) #remove rows that contain altitude values greater than -100,000 meters.
# this is to remove the -3.402823E038 meter altitude values that keep coming up.
#CONVERT PROCESSED DATAFRAME INTO NEW CSV FILE
df = data.to_csv(r'/media/gman/Grant/Maps/processed_maps/corrected_altitude_data.csv') #export good data to this file.
I went with pandas to remove rows based on a logic argument, this made a dataframe. I then exported the dataframe into a csv file.
I'm new in Python language and i'm facing a small challenge in which i havent been able to figure it out so far.
I receive a csv file with around 30-40 columns and 5-50 rows with various details in each cell. The 1st row of the csv has the title for each column and by the 2nd row i have item values.
What i want to do is to create a python script which will read the csv file and every time to do the following:
Add a row after the actual 1st item row, (literally after the 2nd row, cause the 1st row is titles), and in that new 3rd row to contain the same information like the above one with one difference only. in the column "item_subtotal" i want to add the value from the column "discount total".
all the bellow rows should remain as they are, and save this modified csv as a new file with the word "edited" added in the file name.
I could really use some help because so far i've only managed to open the csv file with a python script im developing, but im not able so far to add the contents of the above row to that newly created row and replace that specific value.
Looking forward any help.
Thank you
Here Im attaching the CSV with some values changed for privacy reasons.
order_id,order_number,date,status,shipping_total,shipping_tax_total,fee_total,fee_tax_total,tax_total,discount_total,order_total,refunded_total,order_currency,payment_method,shipping_method,customer_id,billing_first_name,billing_last_name,billing_company,billing_email,billing_phone,billing_address_1,billing_address_2,billing_postcode,billing_city,billing_state,billing_country,shipping_first_name,shipping_last_name,shipping_address_1,shipping_address_2,shipping_postcode,shipping_city,shipping_state,shipping_country,shipping_company,customer_note,item_id,item_product_id,item_name,item_sku,item_quantity,item_subtotal,item_subtotal_tax,item_total,item_total_tax,item_refunded,item_refunded_qty,item_meta,shipping_items,fee_items,tax_items,coupon_items,order_notes,download_permissions_granted,admin_custom_order_field:customer_type_5
15001_TEST_2,,"2017-10-09 18:53:12",processing,0,0.00,0.00,0.00,5.36,7.06,33.60,0.00,EUR,PayoneCw_PayPal,"0,00",0,name,surname,,name.surname#gmail.com,0123456789,"address 1",,41541_TEST,location,,DE,name,surname,address,01245212,14521,location,,DE,,,1328,302,"product title",103,1,35.29,6.71,28.24,5.36,0.00,0,,"id:1329|method_id:free_shipping:3|method_title:0,00|total:0.00",,id:1330|rate_id:1|code:DE-MWST-1|title:MwSt|total:5.36|compound:,"id:1331|code:#getgreengent|amount:7.06|description:Launchcoupon for friends","text string",1,
You can also use pandas to manipulate the data from the csv like this:
import pandas
import copy
Read the csv file into a pandas dataframe:
df = pandas.read_csv(filename)
Make a deepcopy of the first row of data and add the discount total to the item subtotal:
new_row = copy.deepcopy(df.loc[1])
new_row['item_subtotal'] += new_row['discount total']
Concatenate the first 2 rows with the new row and then everything after that:
df = pandas.concat([df.loc[:1], new_row, df.loc[2:]], ignore_index=True)
Change the filename and write the out the new csv file:
filename = filename.strip('.csv') + 'edited.csv'
df.to_csv(filename)
I hope this helps! Pandas is great for cleanly handling massive amounts of data, but may be overkill for what you are trying to do. Then again, maybe not. It would help to see an example data file.
The first step is to turn that .csv into something that is a little easier to work with. Fortunately, python has the 'csv' module which makes it easy to turn your .csv file into a much nicer list of lists. The below will give you a way to both turn your .csv into a list of lists and turn the modified data back into a .csv file.
import csv
import copy
def csv2list(ifile):
"""
ifile = the path of the csv to be converted into a list of lists
"""
f = open(ifile,'rb')
olist=[]
c = csv.reader(f, dialect='excel')
for line in c:
olist.append(line) #and update the outer array
f.close
return olist
#------------------------------------------------------------------------------
def list2csv(ilist,ofile):
"""
ilist = the list of lists to be converted
ofile = the output path for your csv file
"""
with open(ofile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
[csvwriter.writerow(x) for x in ilist]
Now, you can simply copy list[1] and change the appropriate element to reflect your summed value using:
listTemp = copy.deepcopy(ilist[1])
listTemp[n] = listTemp[n] + listTemp[n-x]
ilist.insert(2,listTemp)
As for how to change the file name, just use:
import os
newFileName = os.path.splitext(oldFileName)[0] + "edited" + os.path.splitext(oldFileName)[1]
Hopefully this will help you out!
I'm trying to import a batch of CSV's into PostgreSQL and constantly run into an issue with missing data:
psycopg2.DataError: missing data for column "column_name" CONTEXT:
COPY table_name, line where ever in the CSV that data wasn't
recorded, and here are data values up to the missing column.
There is no way to get the complete set of data written to the row at times, and I have to deal with the files as is. I am trying to figure a way to remove the row if data wasn't recorded into any column. Here's what I have:
file_list = glob.glob(path)
for f in file_list:
filename = os.path.basename(f) #get the file name
arc_csv = arc_path + filename #path for revised copy of CSV
with open(f, 'r') as inp, open(arc_csv, 'wb') as out:
writer = csv.writer(out)
for line in csv.reader(inp):
if "" not in line: #if the row doesn't have any empty fields
writer.writerow(line)
cursor.execute("COPY table_name FROM %s WITH CSV HEADER DELIMITER ','",(arc_csv,))
You could use pandas to remove rows with missing values:
import glob, os, pandas
file_list = glob.glob(path)
for f in file_list:
filename = os.path.basename(f)
arc_csv = arc_path + filename
data = pandas.read_csv(f, index_col=0)
ind = data.apply(lambda x: not pandas.isnull(x.values).any(), axis=1)
# ^ provides an index of all rows with no missing data
data[ind].to_csv(arc_csv) # writes the revised data to csv
However, this could get slow if you're working with large datasets.
EDIT - added index_col=0 as an argument to pandas.read_csv() to prevent the added index column issue. This uses the first column in the csv as an existing index. Replace 0 with another column's number if you have reason not to use the first column as index.
Unfortunately, you cannot parameterize table or column names. Use string formatting, but make sure to validate/escape the value properly:
cursor.execute("COPY table_name FROM {column_name} WITH CSV HEADER DELIMITER ','".format(column_name=arc_csv))
I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!
This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.
You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)
(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)