I currently have 1 text file (tab del) with 15,000 records.
I have another text file (tab del) with 5000 records.
In the file with 5000 records there are some row that match with the file containing 15,000 records. These are identifiable by a column header named URN (unique record number). for example i may need URN 62294 to be taken out of the main file, but i don't know i have to take that one out until i compare the two files and see that it is in both.
how difficult is this to do in python?
Try installing pandas with pip install pandas
Then run this:
import pandas as pd
filename1 = #main file
filename2 = #the other file
main = pd.read_csv(filename1,sep='\t') # sep='\t' is for tab delimited file
side = pd.read_csv(filename2,sep='\t')
main['URN'] = main['URN'].astype(int)
side['URN'] = side['URN'].astype(int)
merge = pd.merge(main,side,on='URN',how='inner') #how=inner means the URN value is in both 2 files
#merge = merge[merge['URN'] != 62294]
print (merge)
merge.to_excel('Output.xlsx,index=False)
Is it difficult ? No, you could do it rather easily with
file1 = open("file1.txt","r")
results = []
for line in file1:
file2 = open("file2.txt","r")
for l in file2:
if (l.split("\t")[0] == line.split("\t")[0]):
results.append(l.split("\t")[0])
break
file2.close()
file1.close()
for i in results:
print(i)
Now, is it the best way ? Probably not for large files.
(Took me 74 seconds with your files).
You can look at Pandas library. It will allow you to load both tables as dataframes and join them on needed column in sql-like style. It should be rather easy with documentation.
Related
Still quite new to this and am struggling.
I have a directory of a few hundred text files, each file has thousands of lines of information on it.
Some lines contain one number, some many
example:
39 312.000000 168.871795
100.835446
101.800298
102.414406
104.491999
108.855079
107.384008
103.608815
I need to pull all of the information from each text file, I want the name of the text file (minus the '.txt') to be in the first column, and all other information following that to complete the row (regardless of its layout within the file)
import pandas as pd
import os
data= '/path/to/data/'
path='/other/directory/path/'
lst=['list of files needed']
for dirpath, dirs, subj in os.walk(data):
while i<=5: #currently being used to break before iterating through entire directory to check it's working
with open(dirpath +lst[i], 'r') as file:
info=file.read().replace('\n', '') #txt file onto one line
corpus.append(lst[i]+' ') #begin list with txt file name
corpus.append(info) #add file contents to list after file name
output=''.join(corpus) #get out of list format
output.split()
i+=1
df=pd.read_table(output, lineterminator=',')
df.to_csv(path + 'testing.csv')
if i >5:
break
Currently, this is printing Errno 2 (no such file or directory) then goes on to print the contents of the first file and no others, but does not save it to csv.
This also seems horribly convoluted and I'm sure there's another way of doing it
I also suspect the lineterminator will not force each new text file onto a new row, so any suggestions there would be appreciated
desired output:
file1 39 312.000 168.871
file2 72 317.212 173.526
You are loading os and pandas so you can take advantage of their functionality (listdir, path, DataFrame, concat, and to_csv) and drastically reduce your code's complexity.
import os
import pandas as pd
data='data/'
path='output/'
files = os.listdir(data)
output = pd.DataFrame()
for file in files:
file_name = os.path.splitext(file)[0]
with open(os.path.join(data, file)) as f:
info = [float(x) for x in f.read().split()]
#print(info)
df = pd.DataFrame(info, columns=[file_name], index = range(len(info)))
output = pd.concat([output, df], axis=1)
output = output.T
print(output)
output.to_csv(path + 'testing.csv', index=False)
I would double-check that your data folder only has txt files. And, maybe add a check for txt files to the code.
This got less elegant as I learned about the requirements. If you want to flip the columns and rows, just take out the output.T line. This transposes the dataframe.
If I have for example 3 txt files that looks as follows:
file1.txt:
a 10
b 20
c 30
file2.txt:
d 40
e 50
f 60
file3.txt:
g 70
h 80
i 90
I would like to read this data from the files and create a single excel file that will look like this:
Specifically in my case I have 100+ txt files that I read using glob and loop.
Thank you
There's a bit of logic involved into getting the output you need.
First, to process the input files into separate lists. You might need to adjust this logic depending on the actual contents of the files. You need to be able to get the columns for the files. For the samples provided my logic works.
I added a safety check to see if the input files have the same number of rows. If they don't it will seriously mess up the resulting excel file. You'll need to add some logic if a length mismatch happens.
For the writing to the excel file, it's very easy using pandas in combination with openpyxl. There are likely more elegant solutions, but I'll leave it to you.
I'm referencing some SO answers in the code for further reading.
requirements.txt
pandas
openpyxl
main.py
# we use pandas for easy saving as XSLX
import pandas as pd
filelist = ["file01.txt", "file02.txt", "file03.txt"]
def load_file(filename: str) -> list:
result = []
with open(filename) as infile:
# the split below is OS agnostic and removes EOL characters
for line in infile.read().splitlines():
# the split below splits on space character by default
result.append(line.split())
return result
loaded_files = []
for filename in filelist:
loaded_files.append(load_file(filename))
# you will want to check if the files have the same number of rows
# it will break stuff if they don't, you could fix it by appending empty rows
# stolen from:
# https://stackoverflow.com/a/10825126/9267296
len_first = len(loaded_files[0]) if loaded_files else None
if not all(len(i) == len_first for i in loaded_files):
print("length mismatch")
exit(419)
# generate empty list of lists so we don't get index error below
# stolen from:
# https://stackoverflow.com/a/33990699/9267296
result = [ [] for _ in range(len(loaded_files[0])) ]
for f in loaded_files:
for index, row in enumerate(f):
result[index].extend(row)
result[index].append('')
# trim the last empty column
result = [line[:-1] for line in result]
# write as excel file
# stolen from:
# https://stackoverflow.com/a/55511313/9267296
# note that there are some other options on this SO question, but this one
# is easily readable
df = pd.DataFrame(result)
with pd.ExcelWriter("output.xlsx") as writer:
df.to_excel(writer, sheet_name="sheet_name_goes_here", index=False)
result:
I have 176 .tsv files as a result of a gene alignment looking like these:
target_id
length
tpm
ENST0001
12
100
ENST0001
9
5
In these files, I expect a certain overlap between target_id columns but not complete, so I would like to do a full join and keep all rows. Additionally, I am interested in keeping only the tpm values of each file and rename the column according to the file name.
The expected dataframe would be something similar to:
target_id
SRR100001
SRR100002
ENST0001
100
7
ENST00015
5
0
I am aware of the join function in bash, but it can be used for two files per time and if I understood correctly I cannot select specific columns...
Thank you in advance!
EDIT: The files are named as SRR*.tsv
Let me know if this code works for you, it's hard to test without having the files.
import re
import os
import sys
import pandas as pd
tpm_dict = {}
for fn in os.listdir(sys.argv[1]):
if re.match('.*\.tsv$', fn):
header = fn.replace('.tsv', '')
this_df = pd.read_csv(os.path.join(sys.argv[1], fn), sep='\t')
for i, row in this_df.iterrows():
try:
tpm_dict[row['target_id']][header] = row['tpm']
except KeyError:
try:
tpm_dict[row['target_id']] = {header: row['tpm']}
except:
print(f"Problem in {fn} at row {i}")
df = pd.DataFrame.from_dict(tpm_dict, orient='index')
df.to_csv('joined.tsv', sep='\t')
Save as tsvjoin.py and then run python3 tsvjoin.py <folder with TSVs>
Edit: typos
Pretty new to python and coding in general. I've been searching for several csv comparison questions and answers and couldn't find anything that helped with this specific comparison problem.
I have two files that contain network asset info. Some devices have multiple IP addresses in one file, and only 1 address in another. Also they don't seem to share uppercase or lowercase format. I'm interested in their hostname values.
(files don't have headers)
file 1:
HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3
hostname4,10.19.0.4,10.19.17.31,10.19.17.32,10.19.17.33,10.19.17.34
hostname5,10.19.0.40,10.19.17.51,10.19.17.52,10.19.17.53,10.19.17.54
hostname6,10.19.0.55,10.19.17.56,10.19.17.57,10.19.17.58,10.19.17.59
File 2
HOSTNAME4,10.19.0.4
HOSTNAME5,10.19.0.40
HOSTNAME6,10.19.0.55
hostname7,192.168.0.1
hostname8,192.168.0.2
hostname9,192.168.0.3
I'd like to compare these files based on hostname (column 0) and output to a third file that contains the rows in file1 that are NOT in file2, ignoring case, and ignoring if they have multiple IP's in file1 or file2.
desired output:
file3:
HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3
I tried a simple comm command in bash to try and see if I could generate the desired result and had no luck, so I decided to try this in python
comm -23 --nocheck-order file1.csv file2.csv > file3.csv
Here's what i've tried in python:
with open('file1.csv', 'r') as f1, open('file2.csv', 'r') as f2:
fileone = f1.readlines()
filetwo = f2.readlines()
with open('file3.csv', 'w') as outFile:
for line in fileone:
if line not in filetwo:
outFile.write(line)
Problem is it isn't writing the rows where the IP list don't match exactly. Even if in column 1 they share a hostname, if the row has multiple ips in one file it isn't counted.
I'm not sure my code above is ignore case and it seems to be trying to match the entire string from a row, rather than "contains."
willing to try pandas package if that makes more sense for this kind of comparison
Your own code is not too far away from what you need to do.
Step 1 : Create a set from the list of hostnames in file2.csv. Here the hostnames are changed to uppercase.
with open('file2.csv') as check_file:
check_set = set([row.split(',')[0].strip().upper() for row in check_file])
Step 2 : Iterate through the lines of file1.csv and check if the hostname is in the set.
with open('file1.csv', 'r') as in_file, open('file3.csv', 'w') as out_file:
for line in in_file:
if line.split(',')[0].strip().upper() not in check_set:
out_file.write(line)
Generated file file3.csv contents:
HOSTNAME1,10.0.0.1
HOSTNAME2,10.0.0.2
HOSTNAME3,10.19.0.3
Since you are interested to use Pandas I would suggest this.
Use read_csv to read the csv file and merge to join both and identify the mismatch. But for this the number of columns in both files should be same(or use names to define columns). Having said that,if you fine with only the first column comparison you can try this.
import pandas as pd
#Read the 2 csv files and take only the first column
file1_df = pd.read_csv('filename1.csv',usecols=[0],names=['Name'])
file2_df = pd.read_csv('filename2.csv',usecols=[0],names=['Name'])
#Converting both the files first column to uppercase to make it case insensitive
file1_df['Name'] = file1_df['Name'].str.upper()
file2_df['Name'] = file2_df['Name'].str.upper()
#Merging both the Dataframe using left join
comparison_result = pd.merge(file1_df,file2_df,on='Name',how='left',indicator=True)
#Filtering only the rows that are available in left(file1)
comparison_result = comparison_result.loc[comparison_result['_merge'] == 'left_only']
print(comparison_result)
As I told, Since the number of columns are different(if separated by comma) in both csv, i'm reading only the first column. Hence output also will be only one column as shown below.
HOSTNAME1
HOSTNAME2
HOSTNAME3
you need to compare the first column only , try something like below
filetwo=[val.split(',')[0].strip().lower() for val in filetwo]
for line in fileone:
if line.split(',')[0].strip().lower() not in filetwo:
print(line)
I am processing a csv file and before that I am getting the row count using the below code.
total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))
The code has been written with the help given in this link.
However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??
In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.
Any help is appreciated!
Let's assume you have a csv file in which some cell's a multi-line text.
$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"
Which, by look of it, has three lines and wc -l agrees:
$ wc -l example.csv
3 example.csv
And so does open with sum:
sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3
But now if you read is with some csv parser such as pandas.read_csv:
import pandas as pd
df = pd.read_csv('./example.csv')
df
colA colB
0 1 Hi. This is Line 1.\nAnd this is Line2
The other alternative way to fetch the correct number of rows is given below:
with open(csv_file,"r",encoding="utf-8") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
Excluding the header, the csv contains 1 line, which I believe is what you expect.
This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.
I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:
1, "my huge text\n with many lines\n"
2, "other text"
Your method for data above will return 4 when accutaly there are only 2 rows
Try to use Pandas or other library for reading CSV files. Example:
import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()
Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.