Related
I would like to add column numbers to 128 columns in a text file
E.g.
My file
12 13 14 15
20 21 23 14
34 56 67 89
Required output
1:12 2:13 3:14 4:15
1: 20 2:21 3: 23 4:14
1: 34 2:56 3:67 4:89
Can this be done using awk / python
I tried paste command for joining two files : one with the values other file with column numbers, manually typed. Since the file size is very large manual typing didnt work.
As of my knowledge I could find answers for adding only one column to the end of a text file.
Thanks for the suggestions
awk to the rescue!
$ awk '{for(i=1;i<=NF;i++) $i=i":"$i}1' file
should do.
Hi I'm trying to basically convert a .TX0 file from a chromatogram file. the file is just a bunch of results including retention times etc... I want to eventually pick certain pieces of data from multiple files and do some analysis. So far I have:
filename = 'filepath'
f = open(filename, 'r')
lines = f.readlines()
print lines
my output is:
Out[29]:
[....................................................
'"condensate analysis (HP4890 Optic - FID)"\n',
'"Peak","Component","Time","Area","Height","BL"\n',
'"#","Name","[min]","[uV*sec]","[uV]",""\n',
'------,------,------,------,------,------\n',
'1,"Methane",0.689,5187666.22,994337.57,*BB\n',
'2,"Ethane",1.061,1453339.93,729285.09,*BB\n',
'3,"Propane",1.715,193334.09,63398.74,*BB\n',
'4,"i-Butane",2.792,157630.92,29233.56,*BV\n',
'5,"n-Butane",3.240,98943.96,15822.72,*VB\n',
'"","","",------,------,""\n',
'"","","",7090915.11,1.83e+06,""\n',
'"Missing Component Report"\n',
'"Component","Expected Retention (Calibration File)"\n',
'------,------\n',
'"All components were found"\n',
'"Report stored in ASCII file :","...
"\n'.......................]
Now, the problem i'm having. I can't get this output into a structured dataframe using pandas... =/ I've tried and it just gives me two columns...
pd.DataFrame(filename)
out:
Out[26]:
0
0 "=============================================...
1 "Software Version:",6.3.2.0646,"Date:","08/06/...
2 "Reprocess Number:","vma2: ......................
.......................
10 ""\n
11 ""\n
12 "condensate analysis (HP4890 Optic - FID)"\n
13 "Peak","Component","Time","Area","Height","BL"\n
14 "#","Name","[min]","[uV*sec]","[uV]",""\n
15 ------,------,------,------,------,------\n
16 1,"Methane",0.689,5187666.22,994337.57,*BB\n
17 2,"Ethane",1.061,1453339.93,729285.09,*BB\n
18 3,"Propane",1.715,193334.09,63398.74,*BB\n
19 4,"i-Butane",2.792,157630.92,29233.56,*BV\n
20 5,"n-Butane",3.240,98943.96,15822.72,*VB\n
21 "","","",------,------,""\n
22 "","","",7090915.11,1.83e+06,""\n
23 "Missing Component Report"\n
24 "Component","Expected Retention (Calibration F...
25 ------,------\n
26 "All components were found"\n
27 "Report stored in ASCII file :","C:\Shared Fol...
I have separated files, one part are files only contained header info, like the example shown in below:
~content of "header1.txt"~
a 3
b 2
c 4
~content of "header2.txt"~
a 4
b 3
c 5
~content of "header3.txt"~
a 1
b 7
c 6
And another part are files only contained data, as shown below:
~content of "data1.txt"~
10 20 30 40
20 14 22 33
~content of "data2.txt"~
11 21 31 41
21 24 12 23
~content of "data3.txt"~
21 22 11 31
10 26 14 33
After combined the corresponded data files, the results are similar with examples in below:
~content of "asc1.txt"~
a 3
b 2
c 4
10 20 30 40
20 14 22 33
~content of "asc2.txt"~
a 4
b 3
c 5
11 21 31 41
21 24 12 23
~content of "asc3.txt"~
a 1
b 7
c 6
21 22 11 31
10 26 14 33
Can anyone give me some help in writing this in python? Thanks!
If you really want it in Python, here is the way to do
for i in range(1,4):
h = open('header{0}.txt'.format(i),'r')
d = open('data{0}.txt'.format(i),'r')
a = open('asc{0}.txt'.format(i),'a')
hdata = h.readlines()
ddata = d.readlines()
a.writelines(hdata+ddata)
a.close()
Of course, assuming that the number of both files is 3 and all follow the same naming convention you used.
Try this (written in python 3.4 idle). Pretty long but should be easier to understand:
# can start by creating a function to read contents of
# each file and return the contents as a string
def readFile(file):
contentsStr = ''
for line in file:
contentsStr += line
return contentsStr
# Read all the header files header1, header2, header3
header1 = open('header1.txt','r')
header2 = open('header2.txt','r')
header3 = open('header3.txt','r')
# Read all the data files data1, data2, data3
data1 = open('data1.txt','r')
data2 = open('data2.txt','r')
data3 = open('data3.txt','r')
# Open/create output files asc1, asc2, asc3
asc1_outFile = open('asc1.txt','w')
asc2_outFile = open('asc2.txt','w')
asc3_outFile = open('asc3.txt','w')
# read contents of each header file and data file into string variabls
header1_contents = readFile(header1)
header2_contents = readFile(header2)
header3_contents = readFile(header3)
data1_contents = readFile(data1)
data2_contents = readFile(data2)
data3_contents = readFile(data3)
# Append the contents of each data file contents to its
# corresponding header file
asc1_contents = header1_contents + '\n' + data1_contents
asc2_contents = header2_contents + '\n' + data2_contents
asc3_contents = header3_contents + '\n' + data3_contents
# now write the necessary results to asc1.txt, asc2.txt, and
# asc3.txt output files respectively
asc1_outFile.write(asc1_contents)
asc2_outFile.write(asc2_contents)
asc3_outFile.write(asc3_contents)
# close the file streams
header1.close()
header2.close()
header3.close()
data1.close()
data2.close()
data3.close()
asc1_outFile.close()
asc2_outFile.close()
asc3_outFile.close()
# done!
By the way, ensure that the header files and data files are in the same directory as the python script. Otherwise, you can simply edit the file paths of these files accordingly in the code above. The output files asc1.txt, asc2.txt, and asc3.txt will be created in the same directory as your python source file.
This works if the number of header file is equal to the number of data files are equal
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory/header*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory/data*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()
Edit
This method will only work if they are in different folder and there should be no other files other than header or data file in that folder
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory1/*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory2/*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()
I have a 200 MB CSV, file and a 4 GB json file in compressed format(300 MB when in compressed form). now I need to check if a particular field in json has a value which matches with any of the values in column 0 of the csv file. How can this be achieved in a fast as I have to do this for multiple json files, csv file being same. I hope using pandas would speed up things
After reading from CSV File the following datastructure is formed:
Empty DataFrame
Columns: []
Index: [1335063, 1339033, 1344453, 1392603, 1520033, 5342858, 5361498, 5534501, 5542881, 5552665, 5618397, 5824472, 5867442, 5908134, 5908134, 6203501, 6208411, 6209921, 6211681, 6212831, 6213691, 6287061, 6293811, 6387151, 6415771, 6508691, 6649281, 6673261, 6716441, 6782181, 6821631, 7710551, 9413871, 11280941, 11285381, 11762751, 11769381, 11854271, 11964831, 11995871, 12240091, 12541201, 12553471, 12633891, 12648021, 12834201, 12899581, 13177041, 13282401, 13290581, 13292951, 13297681, 14536901, 14592891, 14665721, 14843571, 15120821, 15127231, 15531511, 15969981, 16648561, 16808911, 16809381, 17019781, 17021721, 17224241, 17234921, 17327321, 17923721, 17930901, 18577181, 18606681, 19448911, 19557541, 20272801, 20286621, 20295001, 20351761, 21052471, 21062651, 21106501, 21578741, 22279401, 22312931, 23078211, 23164911, 24937351, 24988721, 26171811, 26188561, 26224001, 26379241, 26380531, 26383571, 26386251, 26388621, 27509171, 27825771, 28282901, 28998561, ...]
Now the data t be read from gzip file will be a json string and I can convert it with read_json. But I dont get how to see if the field 'id' in json is present in the lsit shown here
This should get you started:
import numpy as np
import pandas
magic_value = 11
df = pandas.DataFrame(np.random.random_integers(0, 12, size=(10,2)))
# 0 1
# 0 1 1
# 1 5 3
# 2 12 12
# 3 12 8
# 4 11 4
# 5 11 12
# 6 9 7
# 7 7 1
# 8 0 11
# 9 2 1
magic_value in df[0].values
# True
So just read in the JSON data with pandas.read_json, get the value you want (pandas indexing docs), and go to town.
I have a number set which contains 2375013 unique numbers in txt file. The data structure looks like this:
11009
900221
2
3
4930568
293
102
I want to match a number in a line from another data to the number set for extracting data what I need. So, I coded like this:
6 def get_US_users_IDs(filepath, mode):
7 IDs = []
8 with open(filepath, mode) as f:
9 for line in f:
10 sp = line.strip()
11 for id in sp:
12 IDs.append(id.lower())
13 return IDs
75 IDs = "|".join(get_US_users_IDs('/nas/USAuserlist.txt', 'r'))
76 matcher = re.compile(IDs)
77 if matcher.match(user_id):
78 number_of_US_user += 1
79 text = tweet.split('\t')[3]
But it takes a lot of time for running. Is there any idea to reduce run time?
What I understood is that you have a huge number of ids in a file and you want to know if a specific user_id is in this file.
You can use a python set.
fd = open(filepath, mode);
IDs = set(int(id) for id in fd)
...
if user_id in IDs:
number_of_US_user += 1
...