Read in only rows in between certain strings Python - python

So I have a text file that I am trying to read with csv in python, however I only want the rows in between two rows that start with certain strings. I have no problems with just reading the data, I have:
import csv
with open('path to file','r') as inf:
reader = csv.reader(inf, delimiter=" ")
and to get all the data I can just loop through and append to a list:
raw_data=[]
for row in reader:
raw_data.append(row)
I know I can get the rows I want by doing something like:
for row in raw_data:
if row[0] == 'string1':
begin_idx = raw_data.index(row)
elif row[0] == 'string2':
end_idx = raw_data.index(row)
data=[]
for idx in range(begin_idx+1,end_idx):
data.append(raw_data[idx])
However, I was hoping to be able to do this all at once when I first loop through the text file, so if anyone has any ideas on how this could be done it would appreciated.
Note, the reason I am not just looking for index of the rows I want is because they are just a list of integers that will change each time I run this. The pdf to text conversion I run isn't extremely clean, so the row titles don't line up with the actual data for the row.

Iterator objects are nice in that they are just calling next() on the object like reader when using in
So this will allow you to go through this in one linear pass by looping through separately when you hit the starting string. Try this:
import csv
with open('path to file','r') as inf:
reader = csv.reader(inf, delimiter=" ")
data=[]
for row in reader:
if row[0] == 'string1':
for row in reader:
if row[0]=='string2':
break
data.append(row)

You can introduce a state variable into your for loop:
data = []
copying = False
for row in reader:
if copying:
data.append(row)
if row[0] == 'string1':
copying = True
if row[0] == 'string2':
copying = False

Related

Skip Row In CSV If Contains Strings From List

I have a list of approximately 500 strings that I want to check against a CSV file containing 25,000 rows. What I currently have seems to be getting stuck looping. I basically want to skip the row if it contains any of the strings in my string list and then extract other data.
stringList = [] #strings look like "AAA", "AAB", "AAC", etc.
with open('BadStrings.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
for row in filereader:
stringToExclude = row[0]
stringList.append(stringToExclude)
with open('OtherData.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
next(filereader, None) #Skip header row
for row in filereader:
for s in stringList:
if s not in row:
data1 = row[1]
Edit: Not an infinite loop, but looping is taking too long.
according to Niels I would change the 2 loop and iterate over the row itself and check if the current row entry is inside the "bad" list:
for row in filereader:
for s in row:
if s not in stringlist:
data1 = row[0]
And I also dont know what you want to do with data1 but you always change the object reference when an item is not in stringList.
You could use a list to add the items to a list with data1.append(item)
You could try something like this.
stringList = [] #strings look like "AAA", "AAB", "AAC", etc.
with open('BadStrings.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
for row in filereader:
stringToExclude = row[0]
stringList.append(stringToExclude)
data1 = [] # Right now you are overwriting your data1 every time. I don't know what you want to do with it, but you could for exmaple add all row[1] to a list data1
with open('OtherData.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
next(filereader, None) #Skip header row
for row in filereader:
found_s = False
for s in stringList:
if s in row:
found_s = True
break
if not found_s:
data1.append(row[1]) # Add row[1] to the list is no element of stringList is found in row.
Still probably not a huge performance improvement, but at least the for loop for s in stringList: will now stop after s is found.

Replacing and deleting columns from a csv using python

Here is a code that I am writing
import csv
import openpyxl
def read_file(fn):
rows = []
with open(fn) as f:
reader = csv.reader(f, quotechar='"',delimiter=",")
for row in reader:
if row:
rows.append(row)
return rows
replace = {x[0]:x[1:] for x in read_file("replace.csv")}
delete = set( (row[0] for row in read_file("delete.csv")) )
result = []
input_file="input.csv"
with open(input_file) as f:
reader = csv.reader(f, quotechar='"')
for row in reader:
if row:
if row[7] in delete:
continue
elif row[7] in replace:
result.append(replace[row[7]])
else:
result.append(row)
with open ("done.csv", "w+", newline="") as f:
w = csv.writer(f,quotechar='"', delimiter= ",")
w.writerows(result)
here are my files:
input.csv:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-","aaaaa","-","-","bbbbb","-",","
"-","-","-","-","-","-","-","ccccc","-","-","ddddd","-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","
this is a 13 column csv. I am interested only in the 8th and the 11th fields.
this is my replace.csv:
"aaaaa","11111","22222"
delete.csv:
ccccc
so what I am doing is compare the first column of replace.csv(line by line) with the 8th column of input.csv and if they match then replace 8th column of input.csv with the second column of replace.csv and 11th column of input with the 3rd column of replace.csv
and for delete.csv it compares both files line by line and if match is found it deletes the entire row.
and if any line is not present in either replace.csv or delete.csv then print the line as it is.
so my desired output is:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-",11111,"-","-",22222,"-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","
but when I run this code it gives me an output like this:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
11111,22222
where am I going wrong?
I am trying to make changes to my program that I had earlier posted a question about.Since the input file has changed I am trying to make changes to my program.
https://stackoverflow.com/a/54388144/9279313
#anuj
I think SafeDev's solution is optimal but if you don't want to go with pandas, just make little changes in your code.
for row in reader:
if row:
if row[7] in delete:
continue
elif row[7] in replace:
key = row[7]
row[7] = replace[key][0]
row[10]= replace[key][1]
result.append(row)
else:
result.append(row)
Hope this solves your issue.
It's actually quite simple. Instead of making it by scratch just use the panda library. From there it's easier to handle any dataset. This is how you would do it:
EDIT:
import pandas as pd
input_csv = pd.read_csv('input.csv')
replace_csv = pd.read_csv('replace.csv', header=None)
delete_csv = pd.read_csv('delete.csv')
r_lst = [i for i in replace_csv.iloc[:, 0]]
d_lst = [i for i in delete_csv]
input2_csv = pd.DataFrame.copy(input_csv)
for i, row in input_csv.iterrows():
if row['c8'] in r_lst:
input2_csv.loc[i, 'c8'] = replace_csv.iloc[r_lst.index(row['c8']), 1]
input2_csv.loc[i, 'c11'] = replace_csv.iloc[r_lst.index(row['c8']), 2]
if row['c8'] in d_lst:
input2_csv = input2_csv[input2_csv.c8 != row['c8']]
input2_csv.to_csv('output.csv', index=False)
This process can be made even more dynamic by turning it into a function that has parameters of column names and replacing 'c8' and 'c11' with those two parameters.

How to print specific rows in a CSV files which have a specific value in a specific column?

I am new to python. I have a CSV file which I want to print specific row from it I'd appreciate it if you could give me guidance. for example below table I want to print a Row if record Number is 2:
This image shows an example of my case
I have below code as starter which prints out the headers:
with open(filename, "r") as f:
reader = csv.reader(f, delimiter="\t")
first = next(reader)
print(first[0].split(','))
for row in filename:
print()
Thanks!
your example code seems somewhat confused, I presume the file is actually comma separated not tab delimited. otherwise you wouldn't need to do the first[0].split(',').
assuming that's the case, maybe something like this would work:
with open(filename, "r") as f:
reader = csv.reader(f)
# skip header row
header = next(reader)
for row in reader:
if int(row[0]) == 2:
print(row)
if you're after a specific row number, you could use enumerate to count rows and print when you get to the correct one.
In your for loop check if the record number, which is the 0th column, is == 2:
for row in file:
if row[0] == 2:
print(row)

Iterating through particular rows in a csvFile in Python

I have a programming assignment that include csvfiles. So far, I only have a issue with obtaining values from specific rows only, which are the rows that the user wants to look up.
When I got frustrated I just appended each column to a separate list, which is very slow (when the list is printed for test) because each column has hundreds of values.
Question:
The desired rows are the rows whose index[0] == user_input. How can I obtain these particular rows only and ignore the others?
This should give you an idea:
import csv
with open('file.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
user_rows = filter(lambda row: row[0] == user_input, reader)
Python has the module csv
import csv
rows=[]
for row in csv.reader(open('a.csv','r'),delimiter=','):
if(row[0]==user_input):
rows.append(row)
def filter_csv_by_prefix (csv_path, prefix):
with open (csv_path, 'r') as f:
return tuple (filter (lambda line : line.split(',')[0] == prefix, f.readlines ()))
for line in filter_csv_by_prefix ('your_csv_file', 'your_prefix'):
print (line)

Search CSV for string and assign column of cell to a list

I am trying to create a program which will scan a CSV for IMG SRC tags and then test them for their response. I'm stuck with this portion of the code which ideally searches the entire CSV document for a 'SRC' cell (to find the IMG SRC tags), and then assigns that column as the one to run the tests on. Here is my attempt:
src_check = ('SRC')
imp_check = ('Impression')
with open("ORIGINAL.csv", 'r') as csvfile:
reader = csv.reader(csvfile)
for i, row in enumerate(reader):
for j, column in enumerate(row):
if src_check in column[:]:
list = [column[j] for column in csv.reader(csvfile)]
My confusion comes from the fact that when I manually enter the column number into my program, it runs as it should: it tests each cell of the column/list and neatly writes the results next to each tag tested.
To reiterate my problem, I would like this snippet to find the first IMG SRC cell of the entire CSV. Then it would note the number of that column, and I can assign the entire column to a list for the tests to be run. For example, the process after would be:
Column 16 has been identified as carrying the IMG SRC tags.
Assign the contents of the column to a list.
Run request tests on list.
Right now the test result column does not line up with the tags that it tests. Does anyone have a better method in finding a cell based on a string and then assigning the column as a list, in-line with the cells it's testing?
You need to find the column index first and than rewind the file to the begnning before you read the column:
src_check = ('SRC')
with open("ORIGINAL.csv", 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if src_check in row:
col = row.index(src_check)
break
else:
raise ValueError('SRC not found')
# go back to beginning of the file
csfile.seek(0)
values = [row[col] for row in reader]
I suspect your problem is that both csvReader are using the same file descriptor and thus the offset is all messed up.
Try to do one thing after another (and/or reset the offset via csvfile.seek(0)) and it should work.
src_check = ('SRC')
with open("ORIGINAL.csv", 'r') as csvfile:
reader = csv.reader(csvfile)
col_index = -1
for row in reader:
for j, column in enumerate(row):
if src_check in column:
col_index = j
break
if col_index != -1:
break
else:
raise ValueError("Column not found")
csvfile.seek(0)
col_vals = [column[col_index] for column in reader]
print col_vals
Edit: Also you shouldn't use the name of builtin (like "list") as a variable name.

Categories

Resources