How to count number of rows altered by Series.str.replace? - python

I have a dataframe with column comments, I use regex to remove digits. I just want to count how many rows were altered with this pattern. i.e To get a count on how many rows str.replace operated.
df['Comments'] = df['Comments'].str.replace('\d+', '')
Output should look like
Operated on 10 rows

re.subn() method returns the number of replacements performed and new string.
Example: text.txt contains the following lines of content.
No coments in the line 245
you can make colmments in line 200 and 300
Creating a list of lists with regular expressions in python ...Oct 28, 2018
re.sub on lists - python
Sample Code:
count = 0
for line in open('text.txt'):
if (re.subn(r'\d+',"", line)[1]) > 0:
count+=1
print("operated on {} rows".format(count))
For pandas:
data['comments'] = pd.DataFrame(open('text.txt', "r"))
count = 0
for line in data['comments']:
if (re.subn(r'\d+',"", line)[1]) > 0:
count+=1
print("operated on {} rows".format(count))
Output:
operated on 3 rows

See if this helps
import re
op_regex = re.compile("\d+")
df['op_count'] = df['comment'].apply(lambda x :len(op_regex.findall(x)))
print(f"Operation on {len(df[df['op_count'] > 0])} rows")
Using findall which return list of matching strings.

Related

counting the unique words in a text file

Some of the unique words in the text file does not count and I've had no idea what's wrong in my code.
file = open('tweets2.txt','r')
unique_count = 0
lines = file.readlines()
line = lines[3]
per_word = line.split()
for i in per_word:
if line.count(i) == 1:
unique_count=unique_count + 1
print(unique_count)
file.close()
Here is the text file:
"I love REDACTED and Fiesta and all but can REDACTED host more academic-related events besides strand days???"
The output of this code is:
16
The expected output of the code came from the text file should be:
17
"i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
The output of this code is:
20
The expected output of the code came from the text file should be:
23
If you want to count the number of unique whitespace delimited tokens (case-sensitive) in the entire file then:
with open('myfile.txt') as infile:
print(len(set(infile.read().split())))
Maybe count() works with chars not words, instead use python way with set() function to clear duplicated words?
per_word = set(line.split())
print (len(per_word))
You are counting each word as a substring in the whole line because you do:
for i in per_word:
if line.count(i) == 1:
So now some words are repeated as substrings, but not as words. For example, the first word is "i". line.count("i") gives 7 (it is also in "if", "im", etc.) so you don't count it as a unique word (even though it is). If you do:
for i in per_word:
if per_word.count(i) == 1:
then you will count each word as a whole word and get the output you need.
Anyway this is very inefficient (O(n^2)) as you iterate over each word and then count iterates over the whole list again to count it. Either use a set as suggested in other answers or use a Counter:
from collections import Counter
unique_count = 0
line = "i will crack a raw egg on my head if REDACTED move the resumption of classes to Jan 7. im not even kidding."
per_word = line.split()
counter = Counter(per_word)
for count in counter.values():
if count == 1:
unique_count += 1
# Or simply
unique_count = sum(count == 1 for count in counter.values())
print(unique_count)

Searching for how many times a word is occur consecutively in a string w/ Python (PSET6 CS50)

my goal is reading some strings (parts of DNA in this content) from a csv file, and then search another txt file for how many times those strings occur consecutively in those string but my current code creates an infinite loop(I did it that so way since I could not come up with a proper condition for while). Any help is appreciated thanks.
My idea was: Search the goal string if it is in, double its number if that's in too triple an increment the number until it is not in the readed anymore.
#Header line of csv : name,AGATC,AATG,TATC
# so checkstr = [AGATC,AATG,TATC]
#Example of searched strings `GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGTCGTGTAACTACAACAGCTAGTTAATCTGGATATCACCATGACCGAATCATAGATTTCGCCTTAAGGAGCTTTACCATGGCTTGGGATCCAATACTAAGGGCTCGACCTAGGCGAATGAGTTTCAGGTTGGCAATCAGCAACGCTCGCCATCCGGACGACGGCTTACAGTTAGTAGCATAGTACGCGATTTTCGGGAAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTATCTATCTATCTATCTATCT`
For example should be able to find how many times consecutively AGATC occurs in that string and return that or record to memory.
checkstr = [] #global array that tells us what str to read
def readtxt(csvfile,seq):
with open(f'{csvfile}','r') as p:#finding which str to read from header line of the csv
header = csv.reader(p)
for row in header:
checkstr = row[1:]
break
with open(f'{seq}','r') as f:#searching the text for strs
readed = f.read()
for j in checkstr:
n = 1
jnew = n * j
while True:
if jnew in readed:
n += 1
print(f"{jnew} and {n}")
break
else:
break
This operates on the idea that splitting a string by a substring will return an empty string on consecutive substrings. Such as:
s = 'abbcd'
s.split('b')
['a', '', 'cd']
In this case the number of consecutive b in abbcd is the count of empty strings plus 1 (2 in this case).
Expanding upon that we can use itertools groupby to count the number of times each group of text in the split string occurs, which as a result of the previous code means if we count the number of times '' occurs in the list and add one we will get your answer. The try/except statment is to handle instances where your substring is not in the string, and the resulting count is empty.
from itertools import groupby
checkstr = ['AGATC', 'AATG', 'TATC']
s = 'GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGTCGTGTAACTACAACAGCTAGTTAATCTGGATATCACCATGACCGAATCATAGATTTCGCCTTAAGGAGCTTTACCATGGCTTGGGATCCAATACTAAGGGCTCGACCTAGGCGAATGAGTTTCAGGTTGGCAATCAGCAACGCTCGCCATCCGGACGACGGCTTACAGTTAGTAGCATAGTACGCGATTTTCGGGAAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTATCTATCTATCTATCTATCT'
for c in checkstr:
groups = groupby(s.split(c))
try:
print(c,[sum(1 for _ in group)+1 for label, group in groups if label==''][0])
except IndexError:
print(c,0)
Output
AGATC 0
AATG 43
TATC 5

Number not Printing in python when returning amount

I have some code which reads from a text file and is meant to print max and min altitudes but the min altitude is not printing and there is no errors.
altitude = open("Altitude.txt","r")
read = altitude.readlines()
count = 0
for line in read:
count += 1
count = count - 1
print("Number of Different Altitudes: ",count)
def maxAlt(read):
maxA = (max(read))
return maxA
def minAlt(read):
minA = (min(read))
return minA
print()
print("Max Altitude:",maxAlt(read))
print("Min Altitude:",minAlt(read))
altitude.close()
I will include the Altitude text file if it is needed and once again the minimum altitude is not printing
I'm assuming, your file probably contains numbers & line-breaks (\n)
You are reading it here:
read = altitude.readlines()
At this point read is a list of strings.
Now, when you do:
minA = (min(read))
It's trying to get "the smallest string in read"
The smallest string is usually the empty string "" - which most probably exists at the end of your file.
So your minAlt is actually getting printed. But it happens to be the empty string.
You can fix it by parsing the lines you read into numbers.
read = [float(a) for a in altitude.readlines() if a]
Try below solution
altitudeFile = open("Altitude.txt","r")
Altitudes = [float(line) for line in altitudeFile if line] #get file data into list.
Max_Altitude = max(Altitudes)
Min_Altitude = min(Altitudes)
altitudeFile.close()
Change your code to this
with open('numbers.txt') as nums:
lines = nums.read().splitlines()
results = list(map(int, lines))
print(results)
print(max(results))
the first two lines read file and store it as a list. third line convert string list to integer and the last one search in list and return max, use min for minimum.

Remove text:u from strings in python

I am using xlrd library to import values from excel file to python list.
I have a single column in excel file and extracting data row wise.
But the problem is the data i am getting in list is as
list = ["text:u'__string__'","text:u'__string__'",.....so on]
How can i remove this text:u from this to get natural list with strings ?
code here using python2.7
book = open_workbook("blabla.xlsx")
sheet = book.sheet_by_index(0)
documents = []
for row in range(1, 50): #start from 1, to leave out row 0
documents.append(sheet.cell(row, 0)) #extract from first col
data = [str(r) for r in documents]
print data
Iterate over items and remove extra characters from each word:
s=[]
for x in list:
s.append(x[7:-1]) # Slice from index 7 till lastindex - 1
If that's the standard input list you have, you can do it with simple split
[s.split("'")[1] for s in list]
# if your string itself has got "'" in between, using regex is always safe
import re
[re.findall(r"u'(.*)'", s)[0] for s in list]
#Output
#['__string__', '__string__']
I had the same problem. Following code helped me.
list = ["text:u'__string__'","text:u'__string__'",.....so on]
for index, item in enumerate(list):
list[index] = list[index][7:] #Deletes first 7 xharacters
list[index] = list[index][:-1] #Deletes last character

Python: calculate the sum of a column after splitting the column

I'm new at writing python and thought I would re-write some of my programs that are in perl.
I have a tab delimited file, where columns 9-through the end (which varies) needs to be further split and then the sum of part of that column added
for instance, input (only looking at columns 9-12):
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
0:0:1:0 0:0:2:0 0:0:3:0 0:0:4:0
output (sum of each column[2]:
4
8
12
16
All I've got so far is
datacol = line.rstrip("\n").split("\t")
for element in datacol[9:len(datacol)]:
splitcol=int(element.split(r":")[2])
totalcol += splitcol
print(totalcol)
which doesn't work and gives me the sum of column[2] for each row.
Thanks
mysum = 0
with open('myfilename','r') as f:
for line in f:
mysum += int(line.split()[3])
line.split() will turn "123 Hammer 20 36" into ["123", "Hammer", "20", "36"].
We take the fourth value 36 using the index [3]. This is still a string, and can be converted to an integer using int or a decimal (floating-point) number using float.
to check for empty lines add the condition if line: in the for loop. In your particular case you might do something like:
for line in f:
words = line.split()
if len(words)>3:
mysum += int(words[3])
Try this:
totalcol = [0,0,0,0] #Store sum results in a list
with open('myfilename','r') as f:
for line in f
#split line and keep columns 9,10,11,12
#assuming you are counting from 1 and have only 12 columns
datacol = line.rstrip("\n").split("\t")[8:] #lists start at index 0!
#Loop through each column and sum the 3rd element
for i,element in enumerate(datacol):
splitcol=int(element.split(":")[2])
totalcol[i] += splitcol
print(totalcol)

Categories

Resources