I am processing a csv file and before that I am getting the row count using the below code.
total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))
The code has been written with the help given in this link.
However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??
In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.
Any help is appreciated!
Let's assume you have a csv file in which some cell's a multi-line text.
$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"
Which, by look of it, has three lines and wc -l agrees:
$ wc -l example.csv
3 example.csv
And so does open with sum:
sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3
But now if you read is with some csv parser such as pandas.read_csv:
import pandas as pd
df = pd.read_csv('./example.csv')
df
colA colB
0 1 Hi. This is Line 1.\nAnd this is Line2
The other alternative way to fetch the correct number of rows is given below:
with open(csv_file,"r",encoding="utf-8") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
Excluding the header, the csv contains 1 line, which I believe is what you expect.
This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.
I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:
1, "my huge text\n with many lines\n"
2, "other text"
Your method for data above will return 4 when accutaly there are only 2 rows
Try to use Pandas or other library for reading CSV files. Example:
import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()
Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.
Related
I have a CSV file I am trying to run through all rows and pull out a string between two strings using Python. I am new to python. I would then like to return the String found in a new column along with all other columns and rows. SAMPLE of How the CSV looks. I am trying to get everything between /**ZTAG & ZTAG**\
Number Assignment_Group Notes
123456 Team One "2019-06-10 16:24:36 - (Work notes)
05924267 /**ZTAG-DEVICE-HW-APPLICATION-WONT-LAUNCH-STUFF-
SENT-REPAIR-FORCE-REPROCESSED-APPLICATION-ZTAG**\
2019-05-24 16:44:48 - (Work notes)
Attachment:snippet.PNG sys_attachment
sys_id:b08bf432db69ff083bfe3a10ad961961
I have been reading about this for a two days. I know I am missing
something
easy. I have looked at splitting the file in multiple ways.
import csv
import pandas
import re
f = open('test.csv')
csv_f = csv.reader(f)
#match = re.search("/**\ZTAG (.+?) ZTAG**\\", csv_f,flags=re.IGNORECASE)
for row in csv_f:
#print(re.split('/**ZTAG| ',csv_f))
#x = csv_f.split('/**ZTAG')
match = re.search("/**\ZTAG (.+?) ZTAG**\\", csv_f,flags=re.IGNORECASE)
print (row[0])
f.close()
I would need to have all columns and rows return with new column
containing string. EXAMPLE Below
Number, Assignment_group, Notes, TAG
123456, Team One, All stuff, ZTAG-DEVICE-HW-APPLICATION-WONT-
LAUNCH-STUFF-SENT-REPAIR-FORCE-REPROCESSED-APPLICATION-
I believe this regular expression should work:
result = re.search("\/\**ZTAG(.*)ZTAG\**", text)
extracted_text = result.group(1)
this will give you the string
-DEVICE-HW-APPLICATION-WONT-LAUNCH-STUFF- SENT-REPAIR-FORCE-REPROCESSED-APPLICATION-
you can do this for each row in your for loop if necessary
I am new to both Python and Stack Overflow.
I extract from a csv file a few columns into an interim csv file and clean up the data to remove the nan entries. Once I have extracted them, I endup with below two csv files.
Main CSV File:
Sort,Parent 1,Parent 2,Parent 3,Parent 4,Parent 5,Name,Parent 6
1,John,,,Ned,,Dave
2,Sam,Mike,,,,Ken
3,,,Pete,,,Steve
4,,Kerry,,Rachel,,Rog
5,,,Laura,Mitchell,,Kim
Extracted CSV:
Name,ParentNum
Dave,Parent 4
Ken,Parent 2
Steve,Parent 3
Rog,Parent 4
Kim,Parent 4
What I am trying to accomplish is that I would like to recurse through main csv using the name and parent number. But, if I write a for loop it prints empty rows because it is looking up every row for the first value. What is the best approach instead of for loop. I tried dictionary reader to read scv but could not get far. Any help will be appreciated.
CODE:
import xlrd
import csv
import pandas as pd
print('Opening and Reading the msl sheet from the xlsx file')
with xlrd.open_workbook('msl.xlsx') as wb:
sh = wb.sheet_by_index(2)
print("The sheet name is :", sh.name)
with open(msl.csv, 'w', newline="") as f:
c = csv.writer(f)
print('Writing to the CSV file')
for r in range(sh.nrows):
c.writerow(sh.row_values(r))
df1 = pd.read_csv(msl.csv, index_col='Sort')
with open('dirty-processing.csv', 'w', newline="") as tbl_writer1:
c2 = csv.writer(tbl_writer1)
c2.writerow(['Name','Parent'])
for list_item in first_row:
for item in df1[list_item].unique():
row_content = [item, list_item]
c2.writerow(row_content)
Expected Result:
Input Main CSV:
enter image description here
In the above CSV, I would like to grab unique values from each column into a separate file or any other data type. Then also capture the header of the column they are taken from.
Ex:
Negarnaviricota,Phylum
Haploviricotina,Subphylum
...
so on
Next thing is would like to do is get its parent. Which is where I am stuck. Also, as you can see not all columns have data, so I want to get the last non-blank column. Up to this point everything is accomplished using the above code. So the sample output should look like below.
enter image description here
The problem I have can be illustrated by showing a couple of sample rows in my csv (semicolon-separated) file, which look like this:
4;1;"COFFEE; COMPANY";4
3;2;SALVATION ARMY;4
Notice that in one row, a string is in quotation marks and has a semi-colon inside of it (none of the columns have quotations marks around them in my input file except for the ones containing semicolons).
These rows with the quotation marks and semicolons are causing a problem -- basically, my code is counting the semicolon inside quotation marks within the column/field. So when I read in this row, it reads this semicolon inside the string as a delimiter, thus making it seem like this row has an extra field/column.
The desired output would look like this, with no quotation marks around "coffee company" and no semicolon between 'coffee' and 'company':
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Actually, this column with "coffee company" is totally useless to me, so the final file could look like this too:
4;1;xxxxxxxxxxx;4
3;2;xxxxxxxxxxx;4
How can I get rid of just the semi-colons inside of this one particular column, but without getting rid of all of the other semi-colons?
The csv module makes it relatively easy to handle a situation like this:
# Contents of input_file.csv
# 4;1;"COFFEE; COMPANY";4
# 3;2;SALVATION ARMY;4
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
with open(input_file, 'r', newline='') as inp:
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
# If you don't care about what's in the column, use the following instead:
# row[2] = 'xyz' # Value not needed.
print(';'.join(row))
Printed output:
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Follow-on question: How to write this data to a new csv file?
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
output_file = 'output_file.csv'
with open(input_file, 'r', newline='') as inp, \
open(output_file, 'w', newline='') as outp:
writer= csv.writer(outp, delimiter=';')
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
writer.writerow(row)
Here's an alternative approach using the Pandas library which spares you having to code for loops:
import pandas as pd
#Read csv into dataframe df
df = pd.read_csv('data.csv', sep=';', header=None)
#Remove semicolon in column 2
df[2] = df[2].apply(lambda x: x.replace(';', ''))
This gives the following dataframe df:
0 1 2 3
0 4 1 COFFEE COMPANY 4
1 3 2 SALVATION ARMY 4
Pandas provides several inbuilt functions to help you manipulate data or make statistical conclusions. Having the data in a tabular format can also make working with it more intuitive.
I have a number of .csv files that I download into a directory.
Each .csv is suppose to have 3 columns of information. The head of one of these files looks like:
17/07/2014,637580,10.755
18/07/2014,61996,10.8497
21/07/2014,126758,10.8208
22/07/2014,520926,10.8201
23/07/2014,370843,9.2883
The code that I am using to read the .csv into a dataframe (df) is:
df = pd.read_csv(adj_directory+'\\'+filename, error_bad_lines=False,names=['DATE', 'PX', 'RAW'])
Where I name the three columns (DATE, PX and RAW).
This works fine when the file is formatted correctly. However I have noticed that sometimes the .csv has a slightly different format and can look like for example:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
where there is a column value missing and an extra comma appears in the values place. This means that the file fails to load into the dataframe (the df dataframe is empty).
Is there a way to read the data into a dataframe with the extra comma (ignoring the offending row) so the df would look like:
09/07/2014,26268315,NaN
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,NaN
15/07/2014,205019,10.8607
Probably best to fix the file upstream so that missing values aren't filled with a ,. But if necessary you can correct the file in python, by replacing ,, with just , (line-by-line). Taking your bad file as test.csv:
import re
import csv
patt = re.compile(r",,")
with open('corrected.csv', 'w') as f2:
with open('test.csv') as f:
for line in csv.reader(map(lambda s: patt.sub(',', s), f)):
f2.write(','.join(str(x) for x in line))
f2.write('\n')
f2.close()
f.close()
Output: corrected.csv
09/07/2014,26268315,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,
15/07/2014,205019,10.8607
Then you should be able to read in this file without issue
import pandas as pd
df = pd.read_csv('corrected.csv', names=['DATE', 'PX', 'RAW'])
DATE PX RAW
0 09/07/2014 26268315 NaN
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
3 14/07/2014 213357 NaN
4 15/07/2014 205019 10.8607
Had this problem yesterday.
Have you tried:
pd.read_csv(adj_directory+'\\'+filename,
error_bad_lines=False,names=['DATE', 'PX', 'RAW'],
keep_default_na=False,
na_values=[''])
Basically I have data from a mechanical test in the output format .raw and I want to access it in Python.
The file needs to be splitted using delimiter ";" so it contains 13 columns.
By doing this the idea is to index and pullout the desired information, which in my case is the "Extension mm" and "Load N" values as arrays in row 41 in order to create plot.
I have never worked with .raw files and I dont know what to do.
The file can be downloaded here:
https://drive.google.com/file/d/0B0GJeyFBNd4FNEp0elhIWGpWWWM/view?usp=sharing
Hope somebody can help me out there!
you can convert the raw file into csv file then use the csv module remember to set the delimeter=' ' otherwise by default it take comma as delimeter
import csv
with open('TST0002.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=' ')
for row in reader: //this will read each row line by line
print (row[0]) //you can use row[0] to get first element of that row.
Your file looks basically like a .tsv with 40 lines to skip. Could you try this ?
import csv
#export your file.raw to tsv
with open('TST0002.raw') as infile, open('new.tsv', 'w') as outfile:
lines = infile.readlines()[40:]
for line in lines:
outfile.write(line)
Or if you want to make directly some data analysis on your two columns :
import pandas as pd
df = pd.read_csv("TST0002.raw", sep="\t", skiprows=40, usecols=['Extension mm', 'Load N'])
print(df)
output:
Extension mm Load N
0 -118.284 0.1365034
1 -117.779 -0.08668576
2 -117.274 -0.1142517
3 -116.773 -0.1092401
4 -116.271 -0.1144083
5 -11.577 -0.1314806
6 -115.269 -0.03609632
7 -114.768 -0.06334914
....