how do I match a specific number into number set efficiently? - python

I have a number set which contains 2375013 unique numbers in txt file. The data structure looks like this:
11009
900221
2
3
4930568
293
102
I want to match a number in a line from another data to the number set for extracting data what I need. So, I coded like this:
6 def get_US_users_IDs(filepath, mode):
7 IDs = []
8 with open(filepath, mode) as f:
9 for line in f:
10 sp = line.strip()
11 for id in sp:
12 IDs.append(id.lower())
13 return IDs
75 IDs = "|".join(get_US_users_IDs('/nas/USAuserlist.txt', 'r'))
76 matcher = re.compile(IDs)
77 if matcher.match(user_id):
78 number_of_US_user += 1
79 text = tweet.split('\t')[3]
But it takes a lot of time for running. Is there any idea to reduce run time?

What I understood is that you have a huge number of ids in a file and you want to know if a specific user_id is in this file.
You can use a python set.
fd = open(filepath, mode);
IDs = set(int(id) for id in fd)
...
if user_id in IDs:
number_of_US_user += 1
...

Related

How to get the index of sorted timestamps?

I have a text file that contains the following:
n 1 id 10 12:17:32 type 6 is transitioning
n 2 id 10 12:16:12 type 5 is active
n 2 id 10 12:18:45 type 6 is transitioning
n 3 id 10 12:16:06 type 6 is transitioning
n 3 id 10 12:17:02 type 6 is transitioning
...
I need to sort these lines in Python by the timestamp. I can read line by line, collect all timestamps, then sort them using sorted(timestamps) but then I need to arrange the lines according to sorted timestamp.
How to get the index of sorted timestamps?
Is there some more elegant solution (I'm sure there is)?
import time
nID = []
mID = []
ts = []
ntype = []
comm = []
with open('changes.txt') as fp:
while True:
line = fp.readline()
if not line:
break
lx = line.split(' ')
nID.append(lx[1])
mID.append(lx[3])
ts.append(lx[4])
ntype.append(lx[6])
comm.append(lx[7:])
So, now I can use sorted(ts) to sort the timestamp, but I don't get the index of sorted timestamp values.

Text file combining script - Python - Big Data

I was wondering if anyone could help me come up with a better way of doing this,
basically I have text files that are formatted like this (some have more columns some have less, each column separated by spaces)
AA BB CC DD Col1 Col2 Col3
XX XX XX Total 1234 1234 1234
Aaaa OO0 LAHB TEXT 111 41 99
Aaaa OO0 BLAH XETT 112 35 176
Aaaa OO0 BALH TXET 131 52 133
Aaaa OO0 HALB EXTT 144 32 193
These text files ranged in size from a few hundred KB to around 100MB for the newest and largest filesWhat I need to do is combine two or more files by adding the checking to see if there are any duplicate data first of all so checking if AA BB CC and DD from each row match with any rows from the other files, if so then I append the data from Col1 Col2 Col3 (etc) on to that row, if not then I fill the new columns in with zeros. The I calculate the top 100 rows based on the total of each row and output the top 100 results to a webpage.
here is the python code I'm using
import operator
def combine(dataFolder, getName, sort):
files = getName.split(",")
longestHeader = 0
longestHeaderFile =[]
dataHeaders = []
dataHeaderCode = []
fileNumber = 1
combinedFile = {}
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
x = 0
for line in f:
lines.append(line.upper().split())
if x == 1:
break
splitLine = lines[1].index("TOTAL")+1
dataHeaders.extend(lines[0][splitLine:])
headerNumber = 1
for name in lines[0][splitLine:]:
dataHeaderCode.append(str(fileNumber)+"_"+str(headerNumber))
headerNumber += 1
if splitLine > longestHeader:
longestHeader = splitLine
longestHeaderFile = lines[0][:splitLine]
fileNumber += 1
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
for line in f:
lines.append(line.upper().split())
splitLine = lines[1].index("TOTAL")+1
headers = lines[0][:splitLine]
data = lines[0][splitLine:]
for x in range(2, len(lines)):
normalizedLine = {}
lineName = ""
total = 0
for header in longestHeaderFile:
try:
if header == "EE" or header == "DD":
index = splitLine-1
else:
index = headers.index(header)
normalizedLine[header] = lines[x][index]
except ValueError:
normalizedLine[header] = "XX"
lineName += normalizedLine[header]
combinedFile[lineName] = normalizedLine
for header in dataHeaders:
headIndex = dataHeaders.index(header)
name = dataHeaderCode[headIndex]
try:
index = splitLine+data.index(header)
value = int(lines[x][index])
except ValueError:
value = 0
except IndexError:
value = 0
try:
value = combinedFile[lineName][header]
combinedFile[lineName][name] = int(value)
except KeyError:
combinedFile[lineName][name] = int(value)
total += int(value)
combinedFile[lineName]["TOTAL"] = total
combined = sorted(combinedFile.values(), key=operator.itemgetter(sort), reverse=True)
return combined
I'm pretty new to Python so this may not be the most "Pythonic" way of doing it, anyway this works but its slow (about 12 seconds for two files about 6MB each) and when we uploaded the code to our AWS server we found that we would get a 500 error from the server saying headers were too large (when we tried to combine larger files). Can anyone help me refine this into something a bit quicker and more suited for a web environment. Also just to clarify I don't have access to the AWS server or the setting of it, that goes through our Lead Developer, so I have no actual clue on how its set up, I do most of my dev work through localhost then commit to Github.

how can i extract elements from lists in python

I am trying to extract elements from list.
I've looked up a lot of data, but I do not know..
this is my test.txt (text file)
[ left in the table = time, right in the table = value ]
0 81
1 78
2 76
3 74
4 81
5 79
6 80
7 81
8 83
9 83
10 83
11 82
.
.
22 81
23 80
If the current time is equal to the time in the table, i want to extract the value of that time.
this is my demo.py (python file)
import datetime
now = datetime.datetime.now())
current_hour = now.hour
with open('test.txt') as f:
lines = f.readlines()
time = [int(line.split()[0]) for line in lines]
value = [int(line.split()[1]) for line in lines]
>>>time = [0,1,2,3,4,5,....,23]
>>>value = [81,78,76,......,80]
You could make a loop where you iterate over the list, looking for the current hour at every position on the list.
Starting at position 0, it will compare it with the current hour. If it's the same value, it will assign the value at the position it was found in "time" to the variable extractedValue, then it will break the loop.
If it isn't the same value, it will increase by 1 the pos variable, which we use to look into the list. So it will keep searching until the first if is True or the list ends.
pos=0
for i in time:
if(current_hour==time[pos]):
extractedValue=value[pos]
break
else:
pos+=1
pass
Feel free to ask if you don't understand something :)
Assuming unique values for the time column:
import datetime
with open('text.txt') as f:
lines = f.readlines()
#this will create a dictionary with time value from test.txt as the key
time_data_dict = { l.split(' ')[0] : l.split(' ')[1] for l in lines }
current_hour = datetime.now().hour
print(time_data_dict[current_hour])
import datetime
import csv
data = {}
with open('hour.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
k, v = row
data[k] = v
hour = str(datetime.datetime.now().hour)
print(data[str(hour)])

How to split one column into two columns in python?

I have a contig file loaded in pandas like this:
>NODE_1_length_4014_cov_1.97676
1 AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG...
2 CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC...
3 CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC...
4 GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG...
5 CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG...
6 GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC...
7 GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC...
8 TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA...
9 GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG...
10 AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG...
11 GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT...
12 CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA...
13 TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA...
14 AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT...
15 ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT...
16 CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT...
17 TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA...
18 ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA...
19 TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC...
20 GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC...
21 GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG...
22 CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC...
23 ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA...
24 ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA...
25 CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG...
26 AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA...
27 TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT...
28 GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT...
29 CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG...
...
8540 >NODE_2518_length_56_cov_219
8541 CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA...
8542 >NODE_2519_length_56_cov_174
8543 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8544 >NODE_2520_length_56_cov_131
8545 CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA...
8546 >NODE_2521_length_56_cov_118
8547 GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC...
8548 >NODE_2522_length_56_cov_96
8549 CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT...
8550 >NODE_2523_length_56_cov_74
8551 AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA...
8552 >NODE_2524_length_56_cov_70
8553 TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT...
8554 >NODE_2525_length_56_cov_59
8555 GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG...
8556 >NODE_2526_length_56_cov_48
8557 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8558 >NODE_2527_length_56_cov_44
8559 CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT...
8560 >NODE_2528_length_56_cov_42
8561 GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG...
8562 >NODE_2529_length_56_cov_38
8563 GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG...
8564 >NODE_2530_length_56_cov_29
8565 GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT...
8566 >NODE_2531_length_56_cov_26
8567 AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG...
8568 >NODE_2532_length_56_cov_25
8569 GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT...
How to split this one column into two columns, making >NODE_...... in one column and the corresponding sequence in another column? Another issue is the sequences are in multiple lines, how to make them into one string? The result is expected like this:
contig sequence
NODE_1_length_4014_cov_1.97676 AAAAAAAAAAAAAAA
NODE_........ TTTTTTTTTTTTTTT
Thank you very much.
I can't reproduce your example, but my guess is that you are loading file with pandas that is not formatted in a tabular format. From your example it looks like your file is formatted:
>Identifier
sequence
>Identifier
sequence
You would have to parse the file before you can put the information into a pandas dataframe. The logic would be to loop through each line of your file, if the line starts with '>Node' you store the line as an identifier. If not you concatenate them to the sequence value. Something like this:
testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n')
identifiers = []
sequences = []
current_sequence = ''
for line in testfile:
if line.startswith('>'):
identifiers.append(line)
sequences.append(current_sequence)
current_sequence = ''
else:
current_sequence += line.strip('\n')
df = pd.DataFrame({'identifiers' = identifiers,
'sequences' = sequences})
Whether this code works depends on the details of your input which you didn't provide, but that might get you started.

Split the input file-format into a multiple lines list, interpolating number ranges "n-m"

I would need help to separate the csv into a list.
Here is the input file and out put file that I need.
I have a CSV file which look like this (line by line):
1-6
97
153,315,341,535
15,~1510,~1533,~1534,~1535,~1590
I need my output to be:
Col 1 Col 2
1 ~1510
2 ~1533
3 ~1534
4 ~1535
5 ~1590
6
97
153
315
341
535
15
Meaning when I detect "-" sign example 1-6 will be (1 until 6)
and separate the number with and without "~" into 2 different column
However results i get with my code is as below:
Col1 Col2 Col3 Col4 Col5 Col6
6-Jan
97
153 315 341 535
15 ~1510 ~1533 ~1534 ~1535 ~1590
my code:
import csv
with open('testing.csv') as f, open("testing1.csv", "w") as outfile:
writer = csv.writer(outfile)
f.readline() # these are headings should remove them
csv_reader = csv.reader(f, delimiter=",")
for line_list in csv_reader:
skills_list = [line_list[0].split(',')]
for skill in skills_list:
writer.writerow(skill)
Please help. Thanks A lot.
This is how I would do it. read all the data first and construct your columns. Then iterate over the columns and build your csv.
Here is code for building the columns.
import csv
fin = open('testing.csv', 'r')
column_1 = []
column_2 = []
for line in fin:
items = line.split(',')
for item in items:
if '-' in item:
num_range = item.split('-')
column_1 += range(int(num_range[0])+1, int(num_range[1])+1)
elif '~' in item:
column_2.append(item.strip())
else:
column_1.append(item.strip())
fin.close()
You cannot write output until you have read the required input. So the first output line can be written when you have obtained the input ~1510.
The simplest solution is to read the entire input file into memory, then write. I would maintain two lists, then push to the first if no tilde, otherwise to the other one. For output, then, simply loop over these lists, supplying empty values if one of them runs out.
If you need to optimize memory usage (e.g. if there is more input than you can fit into memory), maybe write each line as soon as you can and free up its memory; but this is more challenging to get right.
import itertools as it
results = {
'col1': [],
'col2': [],
}
with open('data.txt') as f:
for line in f:
line = line.rstrip()
entries = line.split(",")
for entry in entries:
if entry.startswith('~'):
column = 'col2'
entry = entry[1:]
else:
column = 'col1'
if '-' in entry:
start, end = entry.split('-')
results[column].extend(
list(range(int(start), int(end)+1))
)
else:
results[column].append(entry)
print("{} \t {}".format('Col 1', 'Col 2'))
column_iter = it.zip_longest(
results['col1'],
["~{}".format(num) for num in results['col2']],
fillvalue=''
)
for col1_num, col2_num in column_iter:
print(
"{} \t {}".format(col1_num, col2_num)
)
--output:--
Col 1 Col 2
1 ~1510
2 ~1533
3 ~1534
4 ~1535
5 ~1590
6
97
153
315
341
535
15
And with this data.txt:
1-6
~7-10,97
153,315,341,535
15,~1510,~1533,~1534,~1535,~1590
output:
Col 1 Col 2
1 ~7
2 ~8
3 ~9
4 ~10
5 ~1510
6 ~1533
97 ~1534
153 ~1535
315 ~1590
341
535
15

Categories

Resources