Python pandas verification block text with rule - python

I am trying to do text verification. I have to verify the candidate of block text belongs to content or non content.
The input for this program is csv file.
The candidate column is shown the the sequence number of candidates text block.
So the line number 82-87 is one text block, 111-116 is the other text block, 1552-1553 is the othre one and so on. And i want to do check each candiddate text block and if candidate fullfil one of the rules then will be used as the output.
Rules for verification the candidate of text block are:
The candiate must be contain h1 and The number of TC column must be > 0 and the LTC column must be < 0.
the number of TC in text block must be more than Threshold TC
The number of TC in text block means the sum of TC in a thext block. for example in candidate 0, number of TC is 0+1+5+7+4+0 = 17.
The threshold TC is 30
If candidate fullfils one of those rules it will be used as the output.
And then I just want present the column words from text block as the output and will be save in txt.
So based on the rule, the output will be the candidate number 0 and 5.
My expected output like
UPDATE MY PROGRAM
import pandas as pd
from listTV import get_filepaths_tv
filenames = get_filepaths_tv(r"C:\Users\firlyarmanda\PycharmProjects\EkstraksiBerita\TC_0.1.5")
index = 0
for f in filenames:
file_html=open(str(f),"r")
dataf = pd.read_csv(file_html)
df = dataf.dropna() #menghilangkan kolom NaN
candidate_groups = df.groupby('candidate')
#f = open('textfile.txt', 'w')
for _, group_df in candidate_groups:
if group_df['TC'].sum() > 40 or (group_df['TAG'] == "['h1']").any() and (group_df['LTC'] == 0).all():
a = '\n'.join(group_df['Words'].astype(str)) + '\n'
#f.write('\n'.join(group_df['Words'].astype(str)) + '\n')
#f.close()
index += 1
stored_file = "textverification/" + '{0:03}'.format(index) + ".txt"
filewrite = open(stored_file, "w")
filewrite.write(a)
filewrite.close
But i got the output separately. I want to join all the output and save to text.

It's not clear what your rules are precisely, but you can use groupby's filter method. First, define a function that checks if a group satisfies the conditions:
def rules(group):
return (group['HTML'].str.contains('<h1>').any() and
group['TC'].sum() > 0 and
group['LTC'].sum() <= 0)
Then filter the dataframe:
result = df.groupby('candidate').filter(rules)
Lastly it's not clear how you want to print the text of selected candidates, but you can get the text of each candidate like this:
result.groupby('candidate')['Words'].apply(lambda w: '\n'.join(w))
This will join all the words in the 'Words' column by the newline character '\n'.
Edit: After discussion, here is what worked for the asker (which includes some code provided in the other answer by user3712352).
candidate_groups = df.groupby('candidate')
f = open('textfile.txt', 'w')
for _, group_df in candidate_groups:
if group_df['TC'].sum() > 30 and (group_df['TAG'] == "['h1']").any() and (group_df['LTC'] == 0).all():
f.write('\n'.join(group_df['Words'].astype(str)) + '\n')
f.close()

After loading the csv:
import pandas as pd
df = pd.read_csv(INPUT_FILE)
A good start for this task would be grouping candidates' rows:
candidate_groups = df.groupby('candidate')
Then you can iterate over candidates and test the requirements:
def print_x(x):
print x
for _, group_df in candidate_groups:
if group_df['TC'].sum() > 30: # 30 is the threshold
if group_df[(group_df['TAG'] == "['h1']") & (group_df['LTC'] < 0) & (group_df['TC'] > 0)].shape[0] > 0:
group_df['Words'].apply(print_x) #print word

Related

Read a file line by line, subtract each number from one, replace hyphens with colons, and print the output on one single line

I have a text file (s1.txt) containing the following information. There are some lines that contain only one number and others that contain two numbers separated by a hyphen.
1
3-5
10
11-13
111
113-150
1111
1123-1356
My objective is to write a program that reads the file line by line, subtracts each number from one, replaces hyphens with colons, and prints the output on one single line. The following is my expected outcome.
{0 2:4 9 10:12 110 112:149 1110 1122:1355}
Using the following code, I am receiving an output that is quite different from what I expected. Please, let me know how I can correct it.
s1_file = input("Enter the name of the S1 file: ")
s1_text = open(s1_file, "r")
# Read contents of the S1 file to string
s1_data = s1_text.read()
for atoms in s1_data.split('\n'):
if atoms.isnumeric():
qm_atom = int(atoms) - 1
#print(qm_atom)
else:
qm_atom = atoms.split('-')
print(qm_atom)
If your goal is to output directly to the screen as a single line you should add end=' ' to the print function.
Or you can store the values in a variable and print everything at the end.
Regardless of that, you were missing at the end to subtract 1 from the values and then join them with the join function. The join function is used on a string where it creates a new string with the values of an array (all values must be strings) separated by the string on which the join method is called.
For example ', '.join(['car', 'bike', 'truck']) would get 'car, bike, truck'.
s1_file = input("Enter the name of the S1 file: ")
s1_text = open(s1_file, "r")
# Read contents of the S1 file to string
s1_data = s1_text.read()
output = []
for atoms in s1_data.split('\n'):
if atoms.isnumeric():
qm_atom = int(atoms) - 1
output.append(str(qm_atom))
else:
qm_atom = atoms.split('-')
# loop the array to subtract 1 from each number
qm_atom_substrated = [str(int(q) - 1) for q in qm_atom]
# join function to combine int with :
output.append(':'.join(qm_atom_substrated))
print(output)
An alternative way of doing it could be:
s1_file = input("Enter the name of the S1 file: ")
with open (s1_file) as f:
output_string = ""
for line in f:
elements = line.strip().split('-')
elements = [int(element) - 1 for element in elements]
elements = [str(element) for element in elements]
elements = ":".join(elements)
output_string += elements + " "
print(output_string)
why are you needlessly complicating a simple task by checking if a element is numerical then handle it else handle it differently.
Also your code gave you a bad output because your else clause is incorrect , it just split elements into sub lists and there is no joining of this sub list with ':'
anyways here is my complete code
f=open(s1_file,'r')
t=f.readlines()#reading all lines
for i in range(0,len(t)):
t[i]=t[i][0:-1]#removing /n
t[i]=t[i].replace('-',':') #replacing - with :
try:t[i]=int(t[i])-1 #convert str into int & process
except:
t[i]=f"{int(t[i].split(':')[0])-1}:{int(t[i].split(':')[1])-1}" #if str case then handle
print(t)

How to count the changes done in new csv file compared to the previous

We have two csv files - new.csv and old.csv.
old.csv contains with four rows:
abc done
xyz done
pqr done
rst pending
The new.csv contains four new rows:
abc pending
xyz not_done
pqr pending
rst done
I need to use count two things without using pandas
count1 = number of entries changed from done to pending = 2 (abc, pqr)
count2 = number of entries changed from done to not_done = 1 (xyz)
CASE 1: CSV Files are in the same order
Firstly import the two files into python lists:
oldcsv = []
with open("old.csv") as f:
for line in f:
oldcsv.append(line.strip().split(","))
newcsv = []
with open("new.csv") as f:
for line in f:
newcsv.append(line.strip().split(","))
Now you would simply iterate through both lists simultaneously, using zip(). I am assuming that both CSV files list the entries in the same order.
count1 = 0
count2 = 0
for oldentry, newentry in zip(oldcsv, newcsv):
assert(oldentry[0] == newentry[0]) # Throw error if entry names do not match
if oldentry[1] == "done":
if newentry[1] == "pending":
count1 += 1
elif newentry[1] == "not_done":
count2 += 1
CASE 2: CSV Files are in arbitrary order
Here, given you are going to be needing to look up entries by their names, I would use a dictionary rather than a list to store the old.csv data, mapping the entry names to their values:
# Load old.csv data into a dictionary mapping entry_name: entry_value
old_values = {}
with open("old.csv") as f:
for line in f:
old_entry = line.strip().split(",")
entry_name, old_entry_value = old_entry[0], old_entry[1]
old_values[entry_name] = old_entry_value
count1 = 0
count2 = 0
with open("new.csv") as f:
for line in f:
# For each entry in new_entry, look up the corresponding old entry in old_entries, and compare their values.
new_entry = line.strip().split(",")
entry_name, new_entry_value = new_entry[0], new_entry[1]
old_entry_value = old_values.get(entry_name) # Get the old value for this entry (will be None if there is no old entry)
# Essentially same code as before:
print(f"{entry_name}: old entry status is {old_entry_value} and new entry status is {new_entry_value}")
if old_entry_value == "done":
if new_entry_value == "pending":
print("Incrementing count1")
count1 += 1
elif new_entry_value == "not_done":
print("Incrementing count2")
count2 += 1
print(count1)
print(count2)
This should work, as long as the input data is properly formatted. I am assuming each .csv file has one entry per line, and each line begins with the entry name (e.g. "abc"), then a comma, then the entry value (e.g. "done","not_done").
Here is a pure python straightforward implementation:
import csv
with open("old.csv") as old_fl:
with open("new.csv") as new_fl:
old = csv.reader(old_fl)
new = csv.reader(new_fl)
old_rows = [row for row in old]
new_rows = [row for row in new]
# see if this is really needed
assert len(old_rows) == len(new_rows)
n = len(old_rows)
# assume that left key is identical,
# and in the same order in both files
assert all(old_rows[i][0] == new_rows[i][0] for i in range(n))
# once the data is guaranteed to align,
# just count what you want
done_to_pending = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "pending"
]
done_to_notdone = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "not_done"
]
It uses the python native csv reader so you don't need to parse the csv yourself. Note that there are various assumptions (assert statements) throughout the code - you might need to adjust the code to handle more cases.

Why is my code did not execute the for-loop in my code?

I created a function and a for-loop, where for-loop has no connection with the function. When code is being executed, the for-loop will be skipped. I want to know why is this happening.
I have a TSV sentiment text file with headers: "positive word count", "negative word count", "total word count" and "text". A function is created to find the average word count, and a for-loop is used to categorize each sentence, and append the "text" into array that its name matches the categorization result. The code runs as intended when I did not call the function, but instead I integrate the calculation in the for-loop. However, I need to know why is this happening as I may not be able to go around every problem like this one.
def get_sentiment_category(p_value, n_value): #has more classification, minimalize
if p_value is not 0 and n_value is 0:
return("positive")
elif p_value is 0 and n_value is not 0:
return("negative")
def get_average_word(data):
total_word = 0
total_sentence = 0
for line in data:
total_sentence += 1
total_word = total_word + int(line[2])
return (total_word/total_sentence)
list_of_category = ["positive","negative"] ## Declare category name
positive = []
negative = []
with open(file, "r", encoding="utf-8-sig") as source:
source_read = csv.reader(source, delimiter='\t') #reading tsv
next(source_read, None)#skip header
average_word = get_average_word(source_read) ## getting average word
print("get avg")
for line in source_read:
print("1212")
p_value = int(line[0]) ## positive value
n_value = int(line[1]) ## negative value
category = get_sentiment_category(p_value,n_value)
globals()[category].append(line[3]) ## if category == "positive", line[3](text) will be append to array with name "positive"
print(average_word)
for item in list_of_category:
print(globals()[item])
print("-----------")
Sample data:
positive'''negative''' word_count''' text
0''''''''''1'''''''''''3'''''''''''''I am sad
1''''''''''0'''''''''''4'''''''''''''I am very happy
## I do not know to to display TSV, forgive me
Output should be:
get avg
1212
1212
3.5
["I am sad"]
-----------
["I am very happy"]
-----------
Instead:
get avg
3.5
[]
-----------
[]
-----------

How to count number of rows altered by Series.str.replace?

I have a dataframe with column comments, I use regex to remove digits. I just want to count how many rows were altered with this pattern. i.e To get a count on how many rows str.replace operated.
df['Comments'] = df['Comments'].str.replace('\d+', '')
Output should look like
Operated on 10 rows
re.subn() method returns the number of replacements performed and new string.
Example: text.txt contains the following lines of content.
No coments in the line 245
you can make colmments in line 200 and 300
Creating a list of lists with regular expressions in python ...Oct 28, 2018
re.sub on lists - python
Sample Code:
count = 0
for line in open('text.txt'):
if (re.subn(r'\d+',"", line)[1]) > 0:
count+=1
print("operated on {} rows".format(count))
For pandas:
data['comments'] = pd.DataFrame(open('text.txt', "r"))
count = 0
for line in data['comments']:
if (re.subn(r'\d+',"", line)[1]) > 0:
count+=1
print("operated on {} rows".format(count))
Output:
operated on 3 rows
See if this helps
import re
op_regex = re.compile("\d+")
df['op_count'] = df['comment'].apply(lambda x :len(op_regex.findall(x)))
print(f"Operation on {len(df[df['op_count'] > 0])} rows")
Using findall which return list of matching strings.

Comparing content in two csv files

So I have two csv files. Book1.csv has more data than similarities.csv so I want to pull out the rows in Book1.csv that do not occur in similarities.csv Here's what I have so far
with open('Book1.csv', 'rb') as csvMasterForDiff:
with open('similarities.csv', 'rb') as csvSlaveForDiff:
masterReaderDiff = csv.reader(csvMasterForDiff)
slaveReaderDiff = csv.reader(csvSlaveForDiff)
testNotInCount = 0
testInCount = 0
for row in masterReaderDiff:
if row not in slaveReaderDiff:
testNotInCount = testNotInCount + 1
else :
testInCount = testInCount + 1
print('Not in file: '+ str(testNotInCount))
print('Exists in file: '+ str(testInCount))
However, the results are
Not in file: 2093
Exists in file: 0
I know this is incorrect because at least the first 16 entries in Book1.csv do not exist in similarities.csv not all of them. What am I doing wrong?
A csv.reader object is an iterator, which means you can only iterate through it once. You should be using lists/sets for containment checking, e.g.:
slave_rows = set(slaveReaderDiff)
for row in masterReaderDiff:
if row not in slave_rows:
testNotInCount += 1
else:
testInCount += 1
After converting it into sets, you can do a lot of set related & helpful operation without writing much of a code.
slave_rows = set(slaveReaderDiff)
master_rows = set(masterReaderDiff)
master_minus_slave_rows = master_rows - slave_rows
common_rows = master_rows & slave_rows
print('Not in file: '+ str(len(master_minus_slave_rows)))
print('Exists in file: '+ str(len(common_rows)))
Here are various set operations that you can do.

Categories

Resources