Looking to break up and check individual cells from a CSV file that was pulled from Excel with Python 3.8. For example, I have a CSV file with the information Honda 1, Toyota 2, Nissan 3... I want to check each cell (not sure what to call the data before the comma delimiter) for an integer and then I want to remove it but also put it in its own cell. So the CSV would then read Honda, 1, Toyota, 2, Nissan, 3... The main goal would be to get those integers in a column next to the manufacturers in Excel.
I am pretty new to python but have some coding background. The logic I was thinking of would be something along the lines of, if char is int then add to new file else add N/A. My main problem is using the data in a csv file to do it. I thought about putting the data from the csv into a variable but the real csv file has over 20,000 cells so I'm not sure if that would be very efficient.
So far my code looks like this:
import csv
path = '/Users/testFolder/Test.csv'
new_path = '/Users/testFolder/Test2.csv'
test_file = open(path,'r')
data = test_file.read()
write_file = open(new_path,'w')
write_file.write(data)
print(data)
file = csv.reader(open(path), delimiter = ',')
for line in file:
print(line)
test_file.close()
write_file.close()
Assuming the parts of each item are separated by one or more spaces, you can do it a row-at-time (instead of reading the whole file into memory) like this:
import csv
path = 'remove_test.csv'
new_path = 'remove_test2.csv'
with open(path, 'r', newline='') as test_file, \
open(new_path, 'w', newline='') as write_file:
reader = csv.reader(test_file, delimiter=',')
writer = csv.writer(write_file, delimiter=',')
for row in reader:
new_row = [part for item in row for part in item.split()]
writer.writerow(new_row)
i have huge csv and i tried to filter data using with open.
I know i can use FINDSTR on command line but i would like to use python to create a new file filtered or i would like to create a pandas dataframe as output.
here is my code:
outfile = open('my_file2.csv', 'a')
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
outfile.write(lines)
The problem is that the output file generated is = input file and there is no filter(and the size of file is the same)
Thanks to all
The problem with your code is the indentation of the last line. It should be within the if-statement, so only lines that contain '31/10/2018' get written.
outfile = open('my_file2.csv', 'a')
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
outfile.write(lines)
To filter using Pandas and creating a DataFrame, do something along the lines of:
import pandas as pd
import datetime
# I assume here that the date is in a seperate column, named 'Date'
df = pd.read_csv('my_file1.csv', parse_dates=['Date'])
# Filter on October 31st 2018
df_filter = df[df['Date'].dt.date == datetime.date(2018, 10, 31)]
# Output to csv
df_filter.to_csv('my_file2.csv', index=False)
(For very large csv's, look at the pd.read_csv() argument 'chunksize')
To use with open(....) as f:, you could do something like:
import pandas as pd
filtered_list = []
with open('my_file1.csv', 'r') as f:
for lines in f:
if '31/10/2018' in lines:
print(lines)
# Split line by comma into list
line_data = lines.split(',')
filtered_list.append(line_data)
# Convert to dataframe and export as csv
df = pd.DataFrame(filtered_list)
df_filter.to_csv('my_file2.csv', index=False)
I have a text file that contains a sentence in each line. Some lines are also empty.
sentence 1
sentence 2
empty line
I want to write the content of this file in a csv file in a way that the csv file has only one column and in each row the corresponding sentence is written. This is what I have tried:
import csv
f = open('data 2.csv', 'w')
with f:
writer = csv.writer(f)
for row in open('data.txt', 'r):
writer.writerow(row)
import pandas as pd
df = pd.read_csv('data 2.csv')
Supposing that I have three sentences in my text file, I want a csv file to have one column with 3 rows. However, when I run the code above, I will get the output below:
[1 rows x 55 columns]
It seems that each character in the sentences is written in one cell and all sentences are written in one row. How should I fix this problem?
So you want to load a text file into a single column of a dataframe, one line per dataframe row. It can be done directly:
with open(data.txt) as file:
df = pd.DataFrame((line.strip() for line in file), columns=['text'])
You can even filter empty lines at read time with filter:
with open(data.txt) as file:
df = pd.DataFrame(filter(lambda x: len(x) > 0, (line.strip() for line in file)),
columns=['text'])
In your code, you iterate through each character in the text file. Try reading line by line through readlines() method:
import csv
f = open('data 2.csv', 'w')
with f:
writer = csv.writer(f)
text_file = open('data.txt', 'r')
for row in text_file.readlines():
writer.writerow(row)
I have data, example :
2017/06/07 10:42:35,THREAT,url,192.168.1.100,52.25.xxx.xxx,Rule-VWIRE-03,13423523,,web-browsing,80,tcp,block-url
2017/06/07 10:43:35,THREAT,url,192.168.1.101,52.25.xxx.xxx,Rule-VWIRE-03,13423047,,web-browsing,80,tcp,allow
2017/06/07 10:43:36,THREAT,end,192.168.1.100,52.25.xxx.xxx,Rule-VWIRE-03,13423047,,web-browsing,80,tcp,block-url
2017/06/07 10:44:09,TRAFFIC,end,192.168.1.101,52.25.xxx.xxx,Rule-VWIRE-03,13423111,,web-browsing,80,tcp,allow
2017/06/07 10:44:09,TRAFFIC,end,192.168.1.103,52.25.xxx.xxx,Rule-VWIRE-03,13423111,,web-browsing,80,tcp,block-url
How to parse that only get data columns 4,5,7, and 12 in all rows?
This is my code :
import csv
file=open('filename.log', 'r')
f=open('fileoutput', 'w')
lines = file.readlines()
for line in lines:
result.append(line.split(' ')[4,5,7,12])
f.write (line)
f.close()
file.close()
The right way with csv.reader and csv.writer objects:
import csv
with open('filename.log', 'r') as fr, open('filoutput.csv', 'w', newline='') as fw:
reader = csv.reader(fr)
writer = csv.writer(fw)
for l in reader:
writer.writerow(v for k,v in enumerate(l, 1) if k in (4,5,7,12))
filoutput.csv contents:
192.168.1.100,52.25.xxx.xxx,13423523,block-url
192.168.1.101,52.25.xxx.xxx,13423047,allow
192.168.1.100,52.25.xxx.xxx,13423047,block-url
192.168.1.101,52.25.xxx.xxx,13423111,allow
192.168.1.103,52.25.xxx.xxx,13423111,block-url
This is wrong:
line.split(' ')[4,5,7,12]
You want this:
fields = line.split(' ')
fields[4], fields[5], fields[7], fields[12]
a solution using pandas
import pandas as pd
df = pd.read_csv('filename.log', sep=',', header=None, index_col=False)
df[[3, 4, 6, 11]].to_csv('fileoutput.csv', header=False, index=False)
Note the use of [3, 4, 6, 11] instead of [4, 5, 7, 12] to account for 0-indexing in the dataframe's columns.
Content of fileoutput.csv:
192.168.1.100,52.25.xxx.xxx,13423523,block-url
192.168.1.101,52.25.xxx.xxx,13423047,allow
192.168.1.100,52.25.xxx.xxx,13423047,block-url
192.168.1.101,52.25.xxx.xxx,13423111,allow
192.168.1.103,52.25.xxx.xxx,13423111,block-url
You're on the right path, but your syntax is off. Here's an example using csv module:
import csv
log = open('filename.log')
# newline='\n' to prevent csv.writer to include additional newline when writing to file
log_write = open('fileoutput', 'w', newline='\n')
csv_log = csv.reader(log, delimiter=',')
csv_writer = csv.writer(log_write, delimiter=',')
for line in csv_log:
csv_writer.writerow([line[0], line[1], line[2], line[3]]) # output first 4 columns
log.close()
log_write.close()
Looking at the list compressions, you could have something like this without necessarily using csv module
file=open('filename.log','r')
f=open('fileoutput', 'w')
lines = file.readlines()
for line in lines:
f.write(','.join(line.split(',')[i] for i in [3,4,6,11]))
f.close()
file.close()
Notice the indices are 3,4,6,11 for our zero index based list
output
cat fileoutput
192.168.1.100,52.25.xxx.xxx,13423523,block-url
192.168.1.101,52.25.xxx.xxx,13423047,allow
192.168.1.100,52.25.xxx.xxx,13423047,block-url
192.168.1.101,52.25.xxx.xxx,13423111,allow
192.168.1.103,52.25.xxx.xxx,13423111,block-url
unique.txt file contains: 2 columns with columns separated by tab. total.txt file contains: 3 columns each column separated by tab.
I take each row from unique.txt file and find that in total.txt file. If present then extract entire row from total.txt and save it in new output file.
###Total.txt
column a column b column c
interaction1 mitochondria_205000_225000 mitochondria_195000_215000
interaction2 mitochondria_345000_365000 mitochondria_335000_355000
interaction3 mitochondria_345000_365000 mitochondria_5000_25000
interaction4 chloroplast_115000_128207 chloroplast_35000_55000
interaction5 chloroplast_115000_128207 chloroplast_15000_35000
interaction15 2_10515000_10535000 2_10505000_10525000
###Unique.txt
column a column b
mitochondria_205000_225000 mitochondria_195000_215000
mitochondria_345000_365000 mitochondria_335000_355000
mitochondria_345000_365000 mitochondria_5000_25000
chloroplast_115000_128207 chloroplast_35000_55000
chloroplast_115000_128207 chloroplast_15000_35000
mitochondria_185000_205000 mitochondria_25000_45000
2_16595000_16615000 2_16585000_16605000
4_2785000_2805000 4_2775000_2795000
4_11395000_11415000 4_11385000_11405000
4_2875000_2895000 4_2865000_2885000
4_13745000_13765000 4_13735000_13755000
My program:
file=open('total.txt')
file2 = open('unique.txt')
all_content=file.readlines()
all_content2=file2.readlines()
store_id_lines = []
ff = open('match.dat', 'w')
for i in range(len(all_content)):
line=all_content[i].split('\t')
seq=line[1]+'\t'+line[2]
for j in range(len(all_content2)):
if all_content2[j]==seq:
ff.write(seq)
break
Problem:
but istide of giving desire output (values of those 1st column that fulfile the if condition). i nead somthing like if jth of unique.txt == ith of total.txt then write ith row of total.txt into new file.
import csv
with open('unique.txt') as uniques, open('total.txt') as total:
uniques = list(tuple(line) for line in csv.reader(uniques))
totals = {}
for line in csv.reader(total):
totals[tuple(line[1:])] = line
with open('output.txt', 'w') as outfile:
writer = csv.writer(outfile)
for line in uniques:
writer.writerow(totals.get(line, []))
I will write your code in this way:
file=open('total.txt')
list_file = list(file)
file2 = open('unique.txt')
list_file2 = list(file2)
store_id_lines = []
ff = open('match.dat', 'w')
for curr_line_total in list_file:
line=curr_line_total.split('\t')
seq=line[1]+'\t'+ line[2]
if seq in list_file2:
ff.write(curr_line_total)
Please, avoid readlines() and use the with syntax when you open your files.
Here is explained why you don't need to use readlines()