Python: calculate the sum of a column after splitting the column - python

I'm new at writing python and thought I would re-write some of my programs that are in perl.
I have a tab delimited file, where columns 9-through the end (which varies) needs to be further split and then the sum of part of that column added
for instance, input (only looking at columns 9-12):
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
output (sum of each column[2]:
4
8
12
16
All I've got so far is
datacol = line.rstrip("\n").split("\t")
for element in datacol[9:len(datacol)]:
splitcol=int(element.split(r":")[2])
totalcol += splitcol
print(totalcol)
which doesn't work and gives me the sum of column[2] for each row.
Thanks

mysum = 0
with open('myfilename','r') as f:
for line in f:
mysum += int(line.split()[3])
line.split() will turn "123 Hammer 20 36" into ["123", "Hammer", "20", "36"].
We take the fourth value 36 using the index [3]. This is still a string, and can be converted to an integer using int or a decimal (floating-point) number using float.
to check for empty lines add the condition if line: in the for loop. In your particular case you might do something like:
for line in f:
words = line.split()
if len(words)>3:
mysum += int(words[3])

Try this:
totalcol = [0,0,0,0] #Store sum results in a list
with open('myfilename','r') as f:
for line in f
#split line and keep columns 9,10,11,12
#assuming you are counting from 1 and have only 12 columns
datacol = line.rstrip("\n").split("\t")[8:] #lists start at index 0!
#Loop through each column and sum the 3rd element
for i,element in enumerate(datacol):
splitcol=int(element.split(":")[2])
totalcol[i] += splitcol
print(totalcol)

Related

Read a file line by line, subtract each number from one, replace hyphens with colons, and print the output on one single line

I have a text file (s1.txt) containing the following information. There are some lines that contain only one number and others that contain two numbers separated by a hyphen.
1
3-5
10
11-13
111
113-150
1111
1123-1356
My objective is to write a program that reads the file line by line, subtracts each number from one, replaces hyphens with colons, and prints the output on one single line. The following is my expected outcome.
{0 2:4 9 10:12 110 112:149 1110 1122:1355}
Using the following code, I am receiving an output that is quite different from what I expected. Please, let me know how I can correct it.
s1_file = input("Enter the name of the S1 file: ")
s1_text = open(s1_file, "r")
# Read contents of the S1 file to string
s1_data = s1_text.read()
for atoms in s1_data.split('\n'):
if atoms.isnumeric():
qm_atom = int(atoms) - 1
#print(qm_atom)
else:
qm_atom = atoms.split('-')
print(qm_atom)
If your goal is to output directly to the screen as a single line you should add end=' ' to the print function.
Or you can store the values in a variable and print everything at the end.
Regardless of that, you were missing at the end to subtract 1 from the values and then join them with the join function. The join function is used on a string where it creates a new string with the values of an array (all values must be strings) separated by the string on which the join method is called.
For example ', '.join(['car', 'bike', 'truck']) would get 'car, bike, truck'.
s1_file = input("Enter the name of the S1 file: ")
s1_text = open(s1_file, "r")
# Read contents of the S1 file to string
s1_data = s1_text.read()
output = []
for atoms in s1_data.split('\n'):
if atoms.isnumeric():
qm_atom = int(atoms) - 1
output.append(str(qm_atom))
else:
qm_atom = atoms.split('-')
# loop the array to subtract 1 from each number
qm_atom_substrated = [str(int(q) - 1) for q in qm_atom]
# join function to combine int with :
output.append(':'.join(qm_atom_substrated))
print(output)
An alternative way of doing it could be:
s1_file = input("Enter the name of the S1 file: ")
with open (s1_file) as f:
output_string = ""
for line in f:
elements = line.strip().split('-')
elements = [int(element) - 1 for element in elements]
elements = [str(element) for element in elements]
elements = ":".join(elements)
output_string += elements + " "
print(output_string)
why are you needlessly complicating a simple task by checking if a element is numerical then handle it else handle it differently.
Also your code gave you a bad output because your else clause is incorrect , it just split elements into sub lists and there is no joining of this sub list with ':'
anyways here is my complete code
f=open(s1_file,'r')
t=f.readlines()#reading all lines
for i in range(0,len(t)):
t[i]=t[i][0:-1]#removing /n
t[i]=t[i].replace('-',':') #replacing - with :
try:t[i]=int(t[i])-1 #convert str into int & process
except:
t[i]=f"{int(t[i].split(':')[0])-1}:{int(t[i].split(':')[1])-1}" #if str case then handle
print(t)

Finding identical numbers in large files python

I have two data files in python, each containing two-column data as below:
3023084 5764
9152549 5812
18461998 5808
45553152 5808
74141469 5753
106932238 5830
112230478 5795
135207137 5800
148813978 5802
154818883 5798
There are about 10M entries in each file (~400Mb).
I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file.
The code I currently have converted the files to lists:
ch1 = []
with open('ch1.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch1.append([line[0], line[1]])
ch2 = []
with open('ch2.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch2.append([line[0], line[1]])
I then iterate through both of the lists looking for a match. When a match is found I with to add the sum of the right hand columns to a new list 'coin'
coin = []
for item1 in ch1:
for item2 in ch2:
if item1[0] == item2[0]:
coin.append(int(item1[1]) + int(item2[1]))
The issue is this is taking a very long time and or crashing. Is there a more efficient way of running with?
There are lots of ways to improve this; for example:
Since you only scan through the contents of ch1.txt once, you don't need to read it into a list, and should thus take up less memory, but probably won't speed things up all that much.
If you sort each of your lists, you can check for matches much more efficiently. Something like:
i1, i2 = 0, 0
while i1 < len(ch1) and i2 < len(ch2):
if ch1[i1][0] == ch2[i2][0]:
# Do what you do for matches
...
# Advance both indices
i1 += 1
i2 += 1
elif ch1[i1][0] < ch2[i2][0]:
# Advance index of the smaller value
i1 += 1
else: # ch1[i1][0] > ch2[i2][0]
i2 += 1
If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file. This should improve efficiency in time and space.
Few ideas to improve this:
store your data in dictionaries in such a way your first column is the key and the second column is the value of a dictionary for later use,
a match is if a key is in the intersection of the keys of the two dictionaries
Code example:
# store your data in dicts as following
ch1_dict[line[0]] = line[1]
ch2_dict[line[0]] = line[1]
#this is what you want to achieve
coin = [int(ch1_dict[key]) + int(ch2_dict[key]) for key in ch1_dict.keys() & ch2_dict.keys()]

How to convert list to 20 column array [duplicate]

This question already has answers here:
How to read a text file into a list or an array with Python
(6 answers)
Closed 3 years ago.
I am making a program that takes a text file that looks something like this:
1
0
1
1
1
and converts it into a list:
['1','0','1','1','1']
The file has 400 lines so I want to convert it into an array that's 20 columns by 20 rows.
just use slicing to chunk it every 20 entries:
lines = [*range(1,401)]
rows_cols = [lines[i:i + 20] for i in range(0, len(lines), 20)]
Detect characters one by one counting at the same time how many characters you detected. There are two cases one when you detect the character and the counter is less than 20 and the other one when you detect the newline character for which you don't update the counter. Therefore in the first case the detected character should be assigned to the list(updating at the same time the column variable) while on the other case you just skip the newline and you continue with the next character of the text file if the counter is less than 20. In case the counter is 20 you just simply update the variable which represents the lines of the list.
this is save the character in 20 column, if in case no of rows are not multple of 20, it will create a list less element than 20 and add it in main list
solu =[]
leng = 20
with open('result.txt','r') as f:
sol = f.readlines()
tmp = []
for i in sol:
if len(tmp)<leng:
tmp.append(i.strip('\n'))
else:
print(tmp)
solu.append(tmp)
tmp=[]
solu.append(tmp)
print(solu)

Number not Printing in python when returning amount

I have some code which reads from a text file and is meant to print max and min altitudes but the min altitude is not printing and there is no errors.
altitude = open("Altitude.txt","r")
read = altitude.readlines()
count = 0
for line in read:
count += 1
count = count - 1
print("Number of Different Altitudes: ",count)
def maxAlt(read):
maxA = (max(read))
return maxA
def minAlt(read):
minA = (min(read))
return minA
print()
print("Max Altitude:",maxAlt(read))
print("Min Altitude:",minAlt(read))
altitude.close()
I will include the Altitude text file if it is needed and once again the minimum altitude is not printing
I'm assuming, your file probably contains numbers & line-breaks (\n)
You are reading it here:
read = altitude.readlines()
At this point read is a list of strings.
Now, when you do:
minA = (min(read))
It's trying to get "the smallest string in read"
The smallest string is usually the empty string "" - which most probably exists at the end of your file.
So your minAlt is actually getting printed. But it happens to be the empty string.
You can fix it by parsing the lines you read into numbers.
read = [float(a) for a in altitude.readlines() if a]
Try below solution
altitudeFile = open("Altitude.txt","r")
Altitudes = [float(line) for line in altitudeFile if line] #get file data into list.
Max_Altitude = max(Altitudes)
Min_Altitude = min(Altitudes)
altitudeFile.close()
Change your code to this
with open('numbers.txt') as nums:
lines = nums.read().splitlines()
results = list(map(int, lines))
print(results)
print(max(results))
the first two lines read file and store it as a list. third line convert string list to integer and the last one search in list and return max, use min for minimum.

How to count number of rows altered by Series.str.replace?

I have a dataframe with column comments, I use regex to remove digits. I just want to count how many rows were altered with this pattern. i.e To get a count on how many rows str.replace operated.
df['Comments'] = df['Comments'].str.replace('\d+', '')
Output should look like
Operated on 10 rows
re.subn() method returns the number of replacements performed and new string.
Example: text.txt contains the following lines of content.
No coments in the line 245
you can make colmments in line 200 and 300
Creating a list of lists with regular expressions in python ...Oct 28, 2018
re.sub on lists - python
Sample Code:
count = 0
for line in open('text.txt'):
if (re.subn(r'\d+',"", line)[1]) > 0:
count+=1
print("operated on {} rows".format(count))
For pandas:
data['comments'] = pd.DataFrame(open('text.txt', "r"))
count = 0
for line in data['comments']:
if (re.subn(r'\d+',"", line)[1]) > 0:
count+=1
print("operated on {} rows".format(count))
Output:
operated on 3 rows
See if this helps
import re
op_regex = re.compile("\d+")
df['op_count'] = df['comment'].apply(lambda x :len(op_regex.findall(x)))
print(f"Operation on {len(df[df['op_count'] > 0])} rows")
Using findall which return list of matching strings.

Categories

Resources