I am trying to read a csv file with pandas. File has 14993 line after headers.
data = pd.read_csv(filename, usecols=['tweet', 'Sentiment'])
print(len(data))
it prints : 14900 and if I add one line to the end of file it is now 14901 rows, so it is not because of memory limit etc. And I also tried "error_bad_lines" but nothing has changed.
By the name of your headers one can supect that you have free text. That can easily trip any csv-parser.
In any case here's a version that easily allows you to track down inconsistencies in the csv, or at least gives a hint of what to look for… and then puts it into a dataframe.
import csv
import pandas as pd
with open('file.csv') as fc:
creader = csv.reader(fc) # add settings as needed
rows = [r for r in creader]
# check consistency of rows
print(len(rows))
print(set((len(r) for r in rows)))
print(tuple(((i, r) for i, r in enumerate(rows) if len(r) == bougus_nbr)))
# find bougus lines and modify in memory, or change csv and re-read it.
# assuming there are headers...
columns = list(zip(*rows))
df = pd.DataFrame({k: v for k, *v in columns if k in ['tweet', 'Sentiment']})
if the dataset is really big, the code should be rewritten to only use generators (which is not that hard to do..).
Only thing not to forget when using a technique like this is that if you have numbers, those columns should be recasted to suitable datatype if needed, but that becomes self evident if one attempts to do math on a dataframe filled with strings.
Related
We are using Pandas to read a CSV into a dataframe:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
Since we are allowing bad lines to be skipped, we want to be able to track how many have been skipped and put it into a value so that we can metric off of it.
To do this, I was thinking of comparing how many rows we have in the dataframe vs the number of rows in the original file.
I think this does what I want:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
initialRowCount = sum(1 for line in open('our_filepath_here'))
difference = initialRowCount - len(someDataframe.index))
But the hardware running this is super limited and I would rather not open the file and iterate through the whole thing just to get a row count when we're already going through the whole thing once via .read_csv. Does anyone know of a better way to get both the successfully processed count and the initial row count for the CSV?
Though I haven't tested this personally, I believe you can count the number of warnings generated by capturing them and checking the length of the returned list of captured warnings. Then add that to current shape of your dataframe:
import warnings
import pandas as pd
with warnings.catch_warnings(record=True) as warning_list:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
# May want to check if each warning object a pandas "bad line warning"
number_of_warned_lines = len(warning_list)
initialRowCount = len(someDataframe) + number_of_warned_lines
https://docs.python.org/3/library/warnings.html#warnings.catch_warnings
Edit: took a little bit of toying around, but this seems to work with Pandas. Instead of depending on the warnings built-in, we'll just temporarily redirect stderr. Then we can count the number of times "Skipping Lines" occurs in that string and we'll end with the count of bad lines with this warning message!
import contextlib
import io
bad_data = io.StringIO("""
a,b,c,d
1,2,3,4
f,g,h,i,j,
l,m,n,o
p,q,r,s
7,8,9,10,11
""".lstrip())
new_stderr = io.StringIO()
with contextlib.redirect_stderr(new_stderr):
df = pd.read_csv(bad_data, error_bad_lines=False, warn_bad_lines=True)
n_warned_lines = new_stderr.getvalue().count("Skipping line")
print(n_warned_lines) # 2
I have some very large CSV files (+15Gb) that contain 4 initial rows of meta data / header info and then the data. The first 3 columns are 3D Cartesian coordinates and are the values I need to change with basic maths operations. e.g. Add, subtract, multiple, divide. I need to do this on mass to each of the coordinate columns. The first 3 columns are float type values
The rest of the columns in the CSV could be of any type, e.g. string, int, etc....
I currently use a script where I can read in each row of the csv and make the modification, then write to a new file and it seems to work fine. But the problem is it takes days on a large file. The machine I'm running on has plenty of memory (120Gb), but mu current method doesn't utilise that.
I know I can update a column on mass using a numpy 2D array if I skip the 4 metadata rows.
e.g
arr = np.genfromtxt(input_file_path, delimiter=',', skip_header=4)
arr[:,0]=np.add(arr[:,0],300)
this will update the first column by adding 300 to each value. But the issue I have with trying to use numpy is
Numpy arrays don't support mixed data types for the rest of the columns that will be imported (I don't know what the other columns will hold so I can't use structured arrays - or rather i want it to be a universal tool so I don't have to know what they will hold)
I can export the numpy array to csv (providing it's not mixed types) and just using regular text functions I can create a separate CSV for the 4 rows of metadata, but then I need to somehow concatenate them and I don't want to have read through all the lines of the data csv just to append it to the bottom of the metadata csv.
I know if I can make this work with Numpy it will greatly increase the speed by utilizing the machine's large amount of memory, by holding the entire csv in memory while I do operations. I've never used pandas but would also consider using it for a solution. I've had a bit of a look into pandas thinking I maybe able to do it with dataframes but I still need to figure out how to have 4 rows as my column header instead of one and additionally I haven't seen a way to apply a mass update to the whole column (like I can with numpy) without using a python loop - not sure if that would make it slow or not if it's already in memory.
The metadata can be empty for rows 2,3,4 but in most cases row 4 will have the data type recorded. There could be up to 200 data columns in addition to the initial 3 coordinate columns.
My current (slow) code looks like this:
import os
import subprocess
import csv
import numpy as np
def move_txt_coords_to(move_by_coords, input_file_path, output_file_path):
# create new empty output file
open(output_file_path, 'a').close()
with open(input_file_path, newline='') as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
if idx < 4:
append_row(output_file_path, row)
else:
new_x = round(float(row[0]) + move_by_coords['x'], 3)
new_y = round(float(row[1]) + move_by_coords['y'], 3)
new_z = round(float(row[2]) + move_by_coords['z'], 3)
row[0] = new_x
row[1] = new_y
row[2] = new_z
append_row(output_file_path, row)
def append_row(output_file, row):
f = open(output_file, 'a', newline='')
writer = csv.writer(f, delimiter=',')
writer.writerow(row)
f.close()
if __name__ == '__main__':
move_by_coords = {
'x': -338802.5,
'y': -1714752.5,
'z': 0
}
input_file_path = r'D:\incoming_data\large_data_set1.csv'
output_file_path = r'D:\outgoing_data\large_data_set_relocated.csv'
move_txt_coords_to(move_by_coords, input_file_path, output_file_path)
Okay so I've got an almost complete answer and it was so much easier than trying to use numpy.
import pandas pd
input_file_path = r'D:\input\large_data.csv'
output_file_path = r'D:\output\large_data_relocated.csv'
move_by_coords = {
'x': -338802.5,
'y': -1714752.5,
'z': 0
}
df = pd.read_csv(input_file_path, header=[0,1,2,3])
df.centroid_x += move_by_coords['x']
df.centroid_y += move_by_coords['y']
df.centroid_z += move_by_coords['z']
df.to_csv(output_file_path,sep=',')
But I have one remaining issue (possibly 2). The blanks cells in my header are being populated with Unnamed. I somehow need it to sub in a blank string for those in the header row.
Also #FBruzzesi has warned me I made need to use batchsize to make it more efficient which i'll need to check out.
---------------------Update-------------
Okay I resolved the multiline header issue. I just use the regular csv reader module to read the first 4 rows into a list of rows, then I transpose this to be a list of columns where I convert the column list to tuples at the same time. Once I have a list of column header tuples (where the tuples consist of each of the rows within that column header), I can use the list to name the header. I there fore skip the header rows on reading the csv to the data frame, and then update each column by it's index. I also drop the index column on export back to csv once done.
It seems work very well.
import csv
import itertools
import pandas as pd
def make_first_4rows_list_of_tuples(input_csv_file_path):
f = open(input_csv_file_path, newline='')
reader = csv.reader(f)
header_rows = []
for row in itertools.islice(reader, 0, 4):
header_rows.append(row)
header_col_tuples = list(map(tuple, zip(*header_rows)))
print("Header columns: \n", header_col_tuples)
return header_col_tuples
if __name__ == '__main__':
move_by_coords = {
'x': 1695381.5,
'y': 5376792.5,
'z': 100
}
input_file_path = r'D:\temp\mydata.csv'
output_file_path = r'D:\temp\my_updated_data.csv'
column_headers = make_first_4rows_list_of_tuples(input_file_path)
df = pd.read_csv(input_file_path, skiprows=4, names=column_headers)
df.iloc[:, 0] += move_by_coords['x']
df.iloc[:, 1] += move_by_coords['y']
df.iloc[:, 2] += move_by_coords['z']
df.to_csv(output_file_path, sep=',', index=False)
I know this is a very easy task, but i am acting pretty dumb right now and dont get it solved. I need to copy the first column of a .csv file including header into a newly created file. My code:
station = 'SD_01'
import csv
import pandas as pd
df = pd.read_csv(str( station ) + "_ED.csv", delimiter =';')
list1 = []
matrix1 = df[df.columns[0]].as_matrix()
list1 = matrix1.tolist()
with open('{0}_RRS.csv'.format(station),"r+") as f:
writer = csv.writer(f)
writer.writerows(map(lambda x: [x], list1))
As result, my file has an empty line between the values, has no header (i could continue without the header, though) and something at the bottom which a can not identify.
>350
>
>351
>
>352
>
>...
>
>949
>
>950
>
>Ž‘’“”•–—˜™š›œžŸ ¡¢
Just a short impression of the 1200+ lines
I am pretty sure that this is a very clunky way to do this; easyier ways are always welcome.
How do i get rid of all the empty lines and this crazy stuff in the end?
When you get a column from a dataframe, it's returned as type Series - and the Series has a built in to_csv method you can use. So you don't need to do any matrix casting or anything like that.
import pandas as pd
df = pd.read_csv('name.csv',delimiter=';')
first_column = df[[df.columns[0]]
first_column.to_csv('new_file.csv')
I'm new in Python language and i'm facing a small challenge in which i havent been able to figure it out so far.
I receive a csv file with around 30-40 columns and 5-50 rows with various details in each cell. The 1st row of the csv has the title for each column and by the 2nd row i have item values.
What i want to do is to create a python script which will read the csv file and every time to do the following:
Add a row after the actual 1st item row, (literally after the 2nd row, cause the 1st row is titles), and in that new 3rd row to contain the same information like the above one with one difference only. in the column "item_subtotal" i want to add the value from the column "discount total".
all the bellow rows should remain as they are, and save this modified csv as a new file with the word "edited" added in the file name.
I could really use some help because so far i've only managed to open the csv file with a python script im developing, but im not able so far to add the contents of the above row to that newly created row and replace that specific value.
Looking forward any help.
Thank you
Here Im attaching the CSV with some values changed for privacy reasons.
order_id,order_number,date,status,shipping_total,shipping_tax_total,fee_total,fee_tax_total,tax_total,discount_total,order_total,refunded_total,order_currency,payment_method,shipping_method,customer_id,billing_first_name,billing_last_name,billing_company,billing_email,billing_phone,billing_address_1,billing_address_2,billing_postcode,billing_city,billing_state,billing_country,shipping_first_name,shipping_last_name,shipping_address_1,shipping_address_2,shipping_postcode,shipping_city,shipping_state,shipping_country,shipping_company,customer_note,item_id,item_product_id,item_name,item_sku,item_quantity,item_subtotal,item_subtotal_tax,item_total,item_total_tax,item_refunded,item_refunded_qty,item_meta,shipping_items,fee_items,tax_items,coupon_items,order_notes,download_permissions_granted,admin_custom_order_field:customer_type_5
15001_TEST_2,,"2017-10-09 18:53:12",processing,0,0.00,0.00,0.00,5.36,7.06,33.60,0.00,EUR,PayoneCw_PayPal,"0,00",0,name,surname,,name.surname#gmail.com,0123456789,"address 1",,41541_TEST,location,,DE,name,surname,address,01245212,14521,location,,DE,,,1328,302,"product title",103,1,35.29,6.71,28.24,5.36,0.00,0,,"id:1329|method_id:free_shipping:3|method_title:0,00|total:0.00",,id:1330|rate_id:1|code:DE-MWST-1|title:MwSt|total:5.36|compound:,"id:1331|code:#getgreengent|amount:7.06|description:Launchcoupon for friends","text string",1,
You can also use pandas to manipulate the data from the csv like this:
import pandas
import copy
Read the csv file into a pandas dataframe:
df = pandas.read_csv(filename)
Make a deepcopy of the first row of data and add the discount total to the item subtotal:
new_row = copy.deepcopy(df.loc[1])
new_row['item_subtotal'] += new_row['discount total']
Concatenate the first 2 rows with the new row and then everything after that:
df = pandas.concat([df.loc[:1], new_row, df.loc[2:]], ignore_index=True)
Change the filename and write the out the new csv file:
filename = filename.strip('.csv') + 'edited.csv'
df.to_csv(filename)
I hope this helps! Pandas is great for cleanly handling massive amounts of data, but may be overkill for what you are trying to do. Then again, maybe not. It would help to see an example data file.
The first step is to turn that .csv into something that is a little easier to work with. Fortunately, python has the 'csv' module which makes it easy to turn your .csv file into a much nicer list of lists. The below will give you a way to both turn your .csv into a list of lists and turn the modified data back into a .csv file.
import csv
import copy
def csv2list(ifile):
"""
ifile = the path of the csv to be converted into a list of lists
"""
f = open(ifile,'rb')
olist=[]
c = csv.reader(f, dialect='excel')
for line in c:
olist.append(line) #and update the outer array
f.close
return olist
#------------------------------------------------------------------------------
def list2csv(ilist,ofile):
"""
ilist = the list of lists to be converted
ofile = the output path for your csv file
"""
with open(ofile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
[csvwriter.writerow(x) for x in ilist]
Now, you can simply copy list[1] and change the appropriate element to reflect your summed value using:
listTemp = copy.deepcopy(ilist[1])
listTemp[n] = listTemp[n] + listTemp[n-x]
ilist.insert(2,listTemp)
As for how to change the file name, just use:
import os
newFileName = os.path.splitext(oldFileName)[0] + "edited" + os.path.splitext(oldFileName)[1]
Hopefully this will help you out!
I would like to know if there is an option for the pandas.read_csv function which allow me load only a certain list of rows from the original csv file.
The csv file is really big, and I cant load the whole file due to a lack of memory.
Is there an option like:
df = pandas.read_csv(file, <b>'read_only'</b> = list_to_read) ?
with list_to_read = [0,2,10] for example (this will only read the row 0, the row 2 and the row 10)
Many thanks in advance
If you go over the docs for read_csv you will find the nrows kwarg:
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
Note however that this will read the n first rows from the file, not arbitrary lines (ie you can't provide it [0, 2, 10] and expect it to read the first, third and eleventh rows)
You may want to iteratively update the dataframe as you read through the file. This is not a fast process, but it will get only the rows of interest into a dataframe without pulling the entire file into memory.
import pandas as pd
col_list = ['columnA', 'columnB', ... ] #fill in your data columns
row_list = [0, 3, 10, ... ]
df = pd.DataFrame(columns=col_list)
row_number = 0
with open('path/to/file', 'rb') as fp:
for i, line in enumerate(fp.xreadlines()):
if i in row_list:
data_line = map(float, line.strip().split(',')) #assumes all columns are floats
df.loc[row_number] = data_line
row_number += 1