Remove rows from CSV file containing certain characters

Remove rows from CSV file containing certain characters - python

I am looking to remove rows from a csv file if they contain specific strings or in their row.
I'd like to be able to create a new output file versus overwriting the original.
I need to remove any rows that contain "py-board" or "coffee"
Example Input:
173.20.1.1,2-base
174.28.2.2,2-game
174.27.3.109,xyz-b13-coffee-2
174.28.32.8,2-play
175.31.4.4,xyz-102-o1-py-board
176.32.3.129,xyz-b2-coffee-1
177.18.2.8,six-jump-walk
Expected Output:
173.20.1.1,2-base
174.28.2.2,2-game
174.28.32.8,2-play
177.18.2.8,six-jump-walk
I tried this
Deleting rows with Python in a CSV file
import csv
with open('input_csv_file.csv', 'rb') as inp, open('purged_csv_file', 'wb') as out:
writer = csv.writer(out)
for row in csv.reader(inp):
if row[1] != "py-board" or if row[1] != "coffee":
writer.writerow(row)
and I tried this
import csv
with open('input_csv_file.csv', 'rb') as inp, open('purged_csv_file', 'wb') as out:
writer = csv.writer(out)
for row in csv.reader(inp):
if row[1] != "py-board":
if row[1] != "coffee":
writer.writerow(row)
and this
if row[1][-8:] != "py-board":
if row[1][-8:] != "coffee-1":
if row[1][-8:] != "coffee-2":
but got this error
File "C:\testing\syslogyamlclean.py", line 6, in <module>
for row in csv.reader(inp):
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

I would actually not use the csv package for this goal. This can be achieved easily using standard file reading and writing.
Try this code (I have written some comments to make it self-explanatory):
# We open the source file and get its lines
with open('input_csv_file.csv', 'r') as inp:
lines = inp.readlines()
# We open the target file in write-mode
with open('purged_csv_file.csv', 'w') as out:
# We go line by line writing in the target file
# if the original line does not include the
# strings 'py-board' or 'coffee'
for line in lines:
if not 'py-board' in line and not 'coffee' in line:
out.write(line)

# pandas helps to read and manipulate .csv file
import pandas as pd
# read .csv file
df = pd.read_csv('input_csv_file.csv', sep=',', header=None)
df
0 1
0 173.20.1.1 2-base
1 174.28.2.2 2-game
2 174.27.3.109 xyz-b13-coffee-2
3 174.28.32.8 2-play
4 175.31.4.4 xyz-102-o1-py-board
5 176.32.3.129 xyz-b2-coffee-1
6 177.18.2.8 six-jump-walk
# filter rows
result = df[np.logical_not(df[1].str.contains('py-board') | df[1].str.contains('coffee'))]
print(result)
0 1
0 173.20.1.1 2-base
1 174.28.2.2 2-game
3 174.28.32.8 2-play
6 177.18.2.8 six-jump-walk
# save to result.csv file
result.to_csv('result.csv', index=False, header=False)

Related

writing a text file to a csv file

I have a text file that contains a sentence in each line. Some lines are also empty.
sentence 1
sentence 2
empty line
I want to write the content of this file in a csv file in a way that the csv file has only one column and in each row the corresponding sentence is written. This is what I have tried:
import csv
f = open('data 2.csv', 'w')
with f:
writer = csv.writer(f)
for row in open('data.txt', 'r):
writer.writerow(row)
import pandas as pd
df = pd.read_csv('data 2.csv')
Supposing that I have three sentences in my text file, I want a csv file to have one column with 3 rows. However, when I run the code above, I will get the output below:
[1 rows x 55 columns]
It seems that each character in the sentences is written in one cell and all sentences are written in one row. How should I fix this problem?

So you want to load a text file into a single column of a dataframe, one line per dataframe row. It can be done directly:
with open(data.txt) as file:
df = pd.DataFrame((line.strip() for line in file), columns=['text'])
You can even filter empty lines at read time with filter:
with open(data.txt) as file:
df = pd.DataFrame(filter(lambda x: len(x) > 0, (line.strip() for line in file)),
columns=['text'])

In your code, you iterate through each character in the text file. Try reading line by line through readlines() method:
import csv
f = open('data 2.csv', 'w')
with f:
writer = csv.writer(f)
text_file = open('data.txt', 'r')
for row in text_file.readlines():
writer.writerow(row)

writing the rows of a csv file to another csv file

I want to write the rows of a csv file to another csv file. I want to change the content of each row as well in a way that if the row is empty, it remains empty and if it is not, any spaces at the beginning and end of the string are omitted. The original csv file has one column and 65422771 rows.
I have written the following to write the rows of the original csv file to the new one:
import csv
csvfile = open('data.csv', 'r')
with open('data 2.csv', "w+") as csv_file1:
writer = csv.writer(csv_file1)
count = 0
for row in csvfile:
row = row.replace('"', '')
count+= 1
print(count)
if row.strip() == '':
writer.writerow('\n')
else:
writer.writerow(row)
However, when the new csv file is made, it is shown that it has 130845543 rows (= count)! The size of the new csv file is also 2 times the size of the original one. How can I create the new csv file with exactly the same number of rows but with the mentioned changes made to them?

Try this:
import csv
with open('data.csv', 'r') as file:
rows = [[row[0].strip()] for row in csv.reader(file)]
with open('data_out.csv', "w", newline = "") as file:
writer = csv.writer(file)
writer.writerows(rows)
Also, as #tripleee mentioned, your file is quite large so you may want to read / write it in chunks. You can use pandas for that.
import pandas as pd
chunksize = 10_000
for chunk in pd.read_csv('data.csv', chunksize = chunksize, header = None):
chunk[0] = chunk[0].str.strip()
chunk.to_csv("data_out.csv", mode="a", header = False, index = False)

drop columns in a txt file by length

I have a txt file with 2 columns and many rows with integers and strings (without IDs), where I need to remove rows longer that 50 characters, for example \
4:33333333:3333333: -:aaaaaeeeeeeeffffffffhhhhhhhh
I guess pandas drop function is not suitable in this case (from description: Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names)
Does Python have any other options?
Thank you!

This will rewrite the file according to that rule. In this case the new paragraph character \n is not counted.
with open('file.txt', 'r') as file:
rows = [line for line in file.readlines() if len(line.strip()) <= 50]
with open('out.txt', 'w') as file:
for row in rows:
file.write(row)
if you need to use the rows for something else do:
rows = [line.strip() for line in file.readlines() if len(line.strip()) <= 50]
to clean the strings.

I think that you could process a large text file by processing chunks (specifying the number of rows in a chunk) using pandas as follows:
import pandas as pd
if __name__ == '__main__':
n_rows_per_process = 10 ** 6 # 1,000,000 rows in a chunk
allowed_column_length = 50
columns = ["feat1", "feat2"]
input_path = "input.txt"
output_path = "results.txt"
with open(output_path, "a") as fout:
for chunk_df in pd.read_csv(input_path, chunksize=n_rows_per_process, sep=r"\s+", names=columns):
tmp_df = chunk_df[chunk_df["feat2"].str.len() < allowed_column_length]
tmp_df.to_csv(fout, mode="a", header=False, sep="\t")
You will get the expected result in the results.txt.

You may read your file line by line and write to (other) file only lines which fulfill your condition:
MAXLEN = 50
with open("input.txt") as inp, open("output.txt", "w") as outp:
for line in inp:
if len(line) <= MAXLEN + 1:
outp.write(line)
Every read line includes the (invisible) ending \n symbol, too, so we compare its length with MAXLEN + 1.

how to remove a specific row in csv file based upon the duplicate using python?

I have a csv file which has many rows looks like below.
20170718 014418.475476 [UE:142 CRNTI : 446]
20170718 094937.865362 [UE:142 CRNTI : 546]
Above are the sample two rows of the csv file.
Now if we see the rows there is a string called [UE:142...] which repeats in the csv file.
Problem statement:
I want to remove the duplicate row which contains string [UE:< > more than once in that csv file i.e in the above rows the string [UE:142 repeated twice so the second one must get deleted, in this way there are many random strings like [UE:142 .
Can anyone please help me with python script for the above problem statement?
import csv
reader = open("test.csv", "r")
lines = reader.read().split(" ")
reader.close()
writer = open("test_1.csv", "w")
for line in set(lines):
writer.write(line)
writer.close()

from csv import reader, writer as csv_writer
csv_path = '<your csv file path here>'
def remove_duplicate_ue (csv_path):
found = False
with open (csv_path, 'r') as csv_file:
for line in reader (csv_file, delimiter = ' '):
if 'UE:' not in line [-1]:
yield line
elif not found:
yield line
found = True
def write_csv (csv_path, rows, delimiter = ' '):
with open (csv_path, 'w') as csv_file:
writer = csv_writer (csv_file, delimiter = delimiter)
for row in rows:
writer.writerow (row)
write_csv (csv_path, tuple (remove_duplicate_ue (csv_path)))

improve my python program to fetch the desire rows by using if condition

unique.txt file contains: 2 columns with columns separated by tab. total.txt file contains: 3 columns each column separated by tab.
I take each row from unique.txt file and find that in total.txt file. If present then extract entire row from total.txt and save it in new output file.
###Total.txt
column a column b column c
interaction1 mitochondria_205000_225000 mitochondria_195000_215000
interaction2 mitochondria_345000_365000 mitochondria_335000_355000
interaction3 mitochondria_345000_365000 mitochondria_5000_25000
interaction4 chloroplast_115000_128207 chloroplast_35000_55000
interaction5 chloroplast_115000_128207 chloroplast_15000_35000
interaction15 2_10515000_10535000 2_10505000_10525000
###Unique.txt
column a column b
mitochondria_205000_225000 mitochondria_195000_215000
mitochondria_345000_365000 mitochondria_335000_355000
mitochondria_345000_365000 mitochondria_5000_25000
chloroplast_115000_128207 chloroplast_35000_55000
chloroplast_115000_128207 chloroplast_15000_35000
mitochondria_185000_205000 mitochondria_25000_45000
2_16595000_16615000 2_16585000_16605000
4_2785000_2805000 4_2775000_2795000
4_11395000_11415000 4_11385000_11405000
4_2875000_2895000 4_2865000_2885000
4_13745000_13765000 4_13735000_13755000
My program:
file=open('total.txt')
file2 = open('unique.txt')
all_content=file.readlines()
all_content2=file2.readlines()
store_id_lines = []
ff = open('match.dat', 'w')
for i in range(len(all_content)):
line=all_content[i].split('\t')
seq=line[1]+'\t'+line[2]
for j in range(len(all_content2)):
if all_content2[j]==seq:
ff.write(seq)
break
Problem:
but istide of giving desire output (values of those 1st column that fulfile the if condition). i nead somthing like if jth of unique.txt == ith of total.txt then write ith row of total.txt into new file.

import csv
with open('unique.txt') as uniques, open('total.txt') as total:
uniques = list(tuple(line) for line in csv.reader(uniques))
totals = {}
for line in csv.reader(total):
totals[tuple(line[1:])] = line
with open('output.txt', 'w') as outfile:
writer = csv.writer(outfile)
for line in uniques:
writer.writerow(totals.get(line, []))

I will write your code in this way:
file=open('total.txt')
list_file = list(file)
file2 = open('unique.txt')
list_file2 = list(file2)
store_id_lines = []
ff = open('match.dat', 'w')
for curr_line_total in list_file:
line=curr_line_total.split('\t')
seq=line[1]+'\t'+ line[2]
if seq in list_file2:
ff.write(curr_line_total)
Please, avoid readlines() and use the with syntax when you open your files.
Here is explained why you don't need to use readlines()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove rows from CSV file containing certain characters - python

Related

writing a text file to a csv file

writing the rows of a csv file to another csv file

drop columns in a txt file by length

how to remove a specific row in csv file based upon the duplicate using python?

improve my python program to fetch the desire rows by using if condition

Categories

Resources