I have a txt file with 2 columns and many rows with integers and strings (without IDs), where I need to remove rows longer that 50 characters, for example \
4:33333333:3333333: -:aaaaaeeeeeeeffffffffhhhhhhhh
I guess pandas drop function is not suitable in this case (from description: Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names)
Does Python have any other options?
Thank you!
This will rewrite the file according to that rule. In this case the new paragraph character \n is not counted.
with open('file.txt', 'r') as file:
rows = [line for line in file.readlines() if len(line.strip()) <= 50]
with open('out.txt', 'w') as file:
for row in rows:
file.write(row)
if you need to use the rows for something else do:
rows = [line.strip() for line in file.readlines() if len(line.strip()) <= 50]
to clean the strings.
I think that you could process a large text file by processing chunks (specifying the number of rows in a chunk) using pandas as follows:
import pandas as pd
if __name__ == '__main__':
n_rows_per_process = 10 ** 6 # 1,000,000 rows in a chunk
allowed_column_length = 50
columns = ["feat1", "feat2"]
input_path = "input.txt"
output_path = "results.txt"
with open(output_path, "a") as fout:
for chunk_df in pd.read_csv(input_path, chunksize=n_rows_per_process, sep=r"\s+", names=columns):
tmp_df = chunk_df[chunk_df["feat2"].str.len() < allowed_column_length]
tmp_df.to_csv(fout, mode="a", header=False, sep="\t")
You will get the expected result in the results.txt.
You may read your file line by line and write to (other) file only lines which fulfill your condition:
MAXLEN = 50
with open("input.txt") as inp, open("output.txt", "w") as outp:
for line in inp:
if len(line) <= MAXLEN + 1:
outp.write(line)
Every read line includes the (invisible) ending \n symbol, too, so we compare its length with MAXLEN + 1.
Related
I am looking to remove rows from a csv file if they contain specific strings or in their row.
I'd like to be able to create a new output file versus overwriting the original.
I need to remove any rows that contain "py-board" or "coffee"
Example Input:
173.20.1.1,2-base
174.28.2.2,2-game
174.27.3.109,xyz-b13-coffee-2
174.28.32.8,2-play
175.31.4.4,xyz-102-o1-py-board
176.32.3.129,xyz-b2-coffee-1
177.18.2.8,six-jump-walk
Expected Output:
173.20.1.1,2-base
174.28.2.2,2-game
174.28.32.8,2-play
177.18.2.8,six-jump-walk
I tried this
Deleting rows with Python in a CSV file
import csv
with open('input_csv_file.csv', 'rb') as inp, open('purged_csv_file', 'wb') as out:
writer = csv.writer(out)
for row in csv.reader(inp):
if row[1] != "py-board" or if row[1] != "coffee":
writer.writerow(row)
and I tried this
import csv
with open('input_csv_file.csv', 'rb') as inp, open('purged_csv_file', 'wb') as out:
writer = csv.writer(out)
for row in csv.reader(inp):
if row[1] != "py-board":
if row[1] != "coffee":
writer.writerow(row)
and this
if row[1][-8:] != "py-board":
if row[1][-8:] != "coffee-1":
if row[1][-8:] != "coffee-2":
but got this error
File "C:\testing\syslogyamlclean.py", line 6, in <module>
for row in csv.reader(inp):
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I would actually not use the csv package for this goal. This can be achieved easily using standard file reading and writing.
Try this code (I have written some comments to make it self-explanatory):
# We open the source file and get its lines
with open('input_csv_file.csv', 'r') as inp:
lines = inp.readlines()
# We open the target file in write-mode
with open('purged_csv_file.csv', 'w') as out:
# We go line by line writing in the target file
# if the original line does not include the
# strings 'py-board' or 'coffee'
for line in lines:
if not 'py-board' in line and not 'coffee' in line:
out.write(line)
# pandas helps to read and manipulate .csv file
import pandas as pd
# read .csv file
df = pd.read_csv('input_csv_file.csv', sep=',', header=None)
df
0 1
0 173.20.1.1 2-base
1 174.28.2.2 2-game
2 174.27.3.109 xyz-b13-coffee-2
3 174.28.32.8 2-play
4 175.31.4.4 xyz-102-o1-py-board
5 176.32.3.129 xyz-b2-coffee-1
6 177.18.2.8 six-jump-walk
# filter rows
result = df[np.logical_not(df[1].str.contains('py-board') | df[1].str.contains('coffee'))]
print(result)
0 1
0 173.20.1.1 2-base
1 174.28.2.2 2-game
3 174.28.32.8 2-play
6 177.18.2.8 six-jump-walk
# save to result.csv file
result.to_csv('result.csv', index=False, header=False)
I have a text file that contains a sentence in each line. Some lines are also empty.
sentence 1
sentence 2
empty line
I want to write the content of this file in a csv file in a way that the csv file has only one column and in each row the corresponding sentence is written. This is what I have tried:
import csv
f = open('data 2.csv', 'w')
with f:
writer = csv.writer(f)
for row in open('data.txt', 'r):
writer.writerow(row)
import pandas as pd
df = pd.read_csv('data 2.csv')
Supposing that I have three sentences in my text file, I want a csv file to have one column with 3 rows. However, when I run the code above, I will get the output below:
[1 rows x 55 columns]
It seems that each character in the sentences is written in one cell and all sentences are written in one row. How should I fix this problem?
So you want to load a text file into a single column of a dataframe, one line per dataframe row. It can be done directly:
with open(data.txt) as file:
df = pd.DataFrame((line.strip() for line in file), columns=['text'])
You can even filter empty lines at read time with filter:
with open(data.txt) as file:
df = pd.DataFrame(filter(lambda x: len(x) > 0, (line.strip() for line in file)),
columns=['text'])
In your code, you iterate through each character in the text file. Try reading line by line through readlines() method:
import csv
f = open('data 2.csv', 'w')
with f:
writer = csv.writer(f)
text_file = open('data.txt', 'r')
for row in text_file.readlines():
writer.writerow(row)
I have a CSV file with, let's say, 16000 rows. I need to split it up in two separate files, but also need an overlap in the files of about 360 rows, so row 1-8360 in one file and row 8000-16000 in the other. Or 1-8000 and 7640-16000.
CSV file look like this:
Value X Y Z
4.5234 -46.29753186 -440.4915915 -6291.285393
4.5261 -30.89639381 -441.8390165 -6291.285393
4.5289 -15.45761327 -442.6481287 -6291.285393
4.5318 0 -442.9179423 -6291.285393
I have used this code in Python 3 to split the file, but I'm unable to get the overlap I want:
with open('myfile.csv', 'r') as f:
csvfile = f.readlines()
linesPerFile = 8000
filename = 1
for i in range(0,len(csvfile),linesPerFile+):
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
f.writelines(csvfile[i:i+linesPerFile])
filename += 1
And tried to modify it like this:
for i in range(0,len(csvfile),linesPerFile+360):
and
f.writelines(csvfile[360-i:i+linesPerFile])
but I haven't been able to make it work.
It's very easy with Pandas CSV and iloc.
import pandas as pd
# df = pd.read_csv('source_file.csv')
df = pd.DataFrame(data=pd.np.random.randn(16000, 5))
df.iloc[:8360].to_csv('file_1.csv')
df.iloc[8000:].to_csv('file_2.csv')
Hope you have got a more elegant answer using Pandas. You could consider below if don't like to install modules.
def write_files(input_file, file1, file2, file1_end_line_no, file2_end_line_no):
# Open all 3 file handles
with open(input_file) as csv_in, open(file1, 'w') as ff, open(file2, 'w') as sf:
# Process headers
header = next(csv_in)
header = ','.join(header.split())
ff.write(header + '\n')
sf.write(header + '\n')
for index, line in enumerate(csv_in):
line_content = ','.join(line.split()) # 4.5234 -46.29753186 -440.4915915 -6291.285393 => 4.5234,-46.29753186,-440.4915915,-6291.285393
if index <= file1_end_line_no: # Check if index is less than or equals first file's max index
ff.write(line_content + '\n')
if index >= file2_end_line_no: # Check if index is greater than or equals second file's max index
sf.write(line_content + '\n')
Sample Run:
if __name__ == '__main__':
in_file = 'csvfile.csv'
write_files(
in_file,
'1.txt',
'2.txt',
2,
2
)
What about this?
for i in range(0,len(csvfile),linesPerFile+):
init = i
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
init = i - 360
f.writelines(csvfile[init:i+linesPerFile+1])
filename += 1
Is this what you are looking for? Please upload a test file if it doesn't so we can provide a better answer :-)
I'm working with a .csv file that lists Timestamps in one column and Wind Speeds in the second column. I need to read through this .csv file and calculate the percent of time where wind speed was above 2m/s. Here's what I have so far.
txtFile = r"C:\Data.csv"
line = o_txtFile.readline()[:-1]
while line:
line = oTextfile.readline()
for line in txtFile:
line = line.split(",")[:-1]
How do I get a count of the lines where the 2nd element in the line is greater than 2?
CSV File Sample
You will probably have to update slightly your CSV, depending on the chosen option (for option 1 and option 2, you will definitely want to remove all header rows, whereas for option 3, you will keep only the middle one, i.e. the one that starts with TIMESTAMP).
You actually have three options:
Option 1: Vanilla Python
count = 0
with open('data.csv', 'r') as file:
for line in file:
value = int(line.split(',')[1])
if value > 100:
count += 1
# Now you have the value in ``count`` variable
Option 2: CSV module
Here I use the Python's CSV module (you could as well use the DictReader, but I'll let you do the search yourself).
import csv
count = 0
with open('data.csv', 'r') as file:
reader = csv.read(file, delimiter=',')
for row in reader:
if int(row[1]) > 100:
count += 1
# Now you have the value in ``count`` variable
Option 3: Pandas
Pandas is a really cool, awesome library used by a lot of people to do data analysis. Doing what you want to do would look like:
import pandas as pd
df = pd.read_csv('data.csv')
# Here you are
count = len(df[df['WindSpd_ms'] > 100])
You can read in the file line by line, if something in it, split it.
You count the lines read and how many are above 10m/s - then calculate the percentage:
# create data file for processing with random data
import random
random.seed(42)
with open("data.txt","w") as f:
f.write("header\n")
f.write("header\n")
f.write("header\n")
f.write("header\n")
for sp in random.choices(range(10),k=200):
f.write(f"some date,{sp+3.5}, data,data,data\n")
# open/read/calculate percentage of data that has 10m/s speeds
days = 0
speedGreater10 = 0
with open("data.txt","r") as f:
for _ in range(4):
next(f) # ignore first 4 rows containing headers
for line in f:
if line: # not empty
_ , speed, *p = line.split(",")
# _ and *p are ignored (they take 'some date' + [data,data,data])
days += 1
if float(speed) > 10:
speedGreater10 += 1
print(f"{days} datapoints, of wich {speedGreater10} "+
f"got more then 10m/s: {speedGreater10/days}%")
Output:
200 datapoints, of wich 55 got more then 10m/s: 0.275%
Datafile:
header
header
header
header
some date,9.5, data,data,data
some date,3.5, data,data,data
some date,5.5, data,data,data
some date,5.5, data,data,data
some date,10.5, data,data,data
[... some more ...]
some date,8.5, data,data,data
some date,3.5, data,data,data
some date,12.5, data,data,data
some date,11.5, data,data,data
unique.txt file contains: 2 columns with columns separated by tab. total.txt file contains: 3 columns each column separated by tab.
I take each row from unique.txt file and find that in total.txt file. If present then extract entire row from total.txt and save it in new output file.
###Total.txt
column a column b column c
interaction1 mitochondria_205000_225000 mitochondria_195000_215000
interaction2 mitochondria_345000_365000 mitochondria_335000_355000
interaction3 mitochondria_345000_365000 mitochondria_5000_25000
interaction4 chloroplast_115000_128207 chloroplast_35000_55000
interaction5 chloroplast_115000_128207 chloroplast_15000_35000
interaction15 2_10515000_10535000 2_10505000_10525000
###Unique.txt
column a column b
mitochondria_205000_225000 mitochondria_195000_215000
mitochondria_345000_365000 mitochondria_335000_355000
mitochondria_345000_365000 mitochondria_5000_25000
chloroplast_115000_128207 chloroplast_35000_55000
chloroplast_115000_128207 chloroplast_15000_35000
mitochondria_185000_205000 mitochondria_25000_45000
2_16595000_16615000 2_16585000_16605000
4_2785000_2805000 4_2775000_2795000
4_11395000_11415000 4_11385000_11405000
4_2875000_2895000 4_2865000_2885000
4_13745000_13765000 4_13735000_13755000
My program:
file=open('total.txt')
file2 = open('unique.txt')
all_content=file.readlines()
all_content2=file2.readlines()
store_id_lines = []
ff = open('match.dat', 'w')
for i in range(len(all_content)):
line=all_content[i].split('\t')
seq=line[1]+'\t'+line[2]
for j in range(len(all_content2)):
if all_content2[j]==seq:
ff.write(seq)
break
Problem:
but istide of giving desire output (values of those 1st column that fulfile the if condition). i nead somthing like if jth of unique.txt == ith of total.txt then write ith row of total.txt into new file.
import csv
with open('unique.txt') as uniques, open('total.txt') as total:
uniques = list(tuple(line) for line in csv.reader(uniques))
totals = {}
for line in csv.reader(total):
totals[tuple(line[1:])] = line
with open('output.txt', 'w') as outfile:
writer = csv.writer(outfile)
for line in uniques:
writer.writerow(totals.get(line, []))
I will write your code in this way:
file=open('total.txt')
list_file = list(file)
file2 = open('unique.txt')
list_file2 = list(file2)
store_id_lines = []
ff = open('match.dat', 'w')
for curr_line_total in list_file:
line=curr_line_total.split('\t')
seq=line[1]+'\t'+ line[2]
if seq in list_file2:
ff.write(curr_line_total)
Please, avoid readlines() and use the with syntax when you open your files.
Here is explained why you don't need to use readlines()