I'm starting some text analysis on some csv documents. However my csv document has several sentences with few words which do not interest me, so I wanted to create a python code that analyzed this csv document and left only the sentences that contain more than 5 words for my analysis, however I do not I know where to start making my code and would like some help.
example:
Input document
enter image description here
Output document
enter image description here
This should work (with Python 3.5):
lines = []
finalLines = []
toRemove = ['a', 'in', 'the']
with open('export.csv') as f:
lines.append(f.readlines())
for line in lines:
temp = list(csv.reader(line))
sentence = ''
for word in temp[0][0].split():
if (word not in toRemove):
sentence = sentence + ' ' + word
finalLines.append(sentence.strip())
print(finalLines)
You can get your work done efficiently and with ease if you use pandas (python library widely used for data manipulation). Here is the link for official pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/
Note: Pandas has built-in functions for reading csv files. You can use 'skiprow' parameter to skiprows you dont want or apply regex to filter text.
Related
I found this post looking for a way to identify and clean abbreviations within my dataframe. The code works well for my use case.
However, I'm dealing with a large data set and was wondering if there was a better or proficient way to apply this without dealing with memory issues.
In order for me to run the code snipet, I sampled 10% of the original dataset and it runs perfectly. If I run the full dataset, my laptop locks.
Below is updated version of the original code:
import spacy
from scispacy.abbreviation import AbbreviationDetector
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 43793966
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
text = [nlp(text, disable = ['ner', 'parser','tagger']) for text in train.text]
text = ' '.join([str(elem) for elem in text])
doc = nlp(text)
#Print the Abbreviation and it's definition
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
How to extract only text from paragraphs and table using python module from word document having objects like hyperlinks, images, attached excel sheet?
I tried docx2python but it only works for simple "docx" files and not for which have links or excel file attached inside of them.
Would this work?
import docx
doc = docx.Document(FILEPATH)
text = []
for i in range(num_of_pargrphs):
line = [run.text for run in doc.paragraphs[i].runs]
if line != []:
# If you need a list of paragraphs
# text.append(line)
result = ''.join(line)
# Printing out final results
print(result)
Also maybe for reading tables in documents you can use this:
https://github.com/gressa-cpu/Python-Code-to-Share/blob/main/read_word_table.py
I'm trying to create my own corpus out of a set of text files. However, I want to do some preprocessing on the text files before they get corpus-ized and I can't figure out how to do that, short of creating a script to run through every single text file first, do the text preprocessing, save a new text file, and then make the corpus on the new, post-processed files. (This seems inefficient now, because I have ~200 mb of files that I would need to read through twice, and is not really scalable if I had a much larger corpus.)
The preprocessing that I want to do is very basic text manipulation:
Make every word as listed in the corpus lower case
Remove any items entirely enclosed in brackets, e.g., [coughing]
Remove digits at the start of each line (they're line numbers from the original transcriptions) which are the first four characters of each line
Critically, I want to do this preprocessing BEFORE the words enter the corpus - I don't want, e.g., "[coughing]" or "0001" as an entry in my corpus, and instead of "TREE" I want "tree."
I've got the basic corpus reader code, but the problem is that I can't figure out how to modify pattern matching as it reads in the files and builds the corpus. Is there a good way to do this?
corpusdir = "C:/corpus/"
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
corpus_words = newcorpus.words() # get words in the corpus
fdist = nltk.FreqDist(corpus_words) # make frequency distribution of the words in the corpus
This answer seems sort of on the right track, but the relevant words are already in the corpus and the poster wants to ignore/strip punctuation before tokenizing the corpus. I want to affect which types of words are even entered (i.e., counted) in the corpus at all.
Thanks in advance!
I disagree with your inefficiency comment because once the corpus has been processed, you can analyze the processed corpus multiple times without having to run a cleaning function each time. That being said, if you are going to be running this multiple times, maybe you would want to find a quicker option.
As far as I can understand, PlaintextCorpusReader needs files as an input. I used code from Alvas' answer on another question to build this response. See Alvas' fantastic answer on using PlaintextCorpusReader here.
Here's my workflow:
from glob import glob
import re
import os
from nltk.corpus import PlaintextCorpusReader
from nltk.probability import FreqDist as FreqDist
mycorpusdir = glob('path/to/your/corpus/*')
# captures bracket-ed text
re_brackets = r'(\[.*?\])'
# exactly 4 numbers
re_numbers = r'(\d{4})'
Lowercase everything, remove numbers:
corpus = []
for file in mycorpusdir:
f = open(file).read()
# lowercase everything
all_lower = f.lower()
# remove brackets
no_brackets = re.sub(re_brackets, '', all_lower)
# remove #### numbers
just_words = re.sub(re_numbers, '', no_brackets)
corpus.append(just_words)
Make new directory for the processed corpus:
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
with open(corpusdir + str(filename) + '.txt' , 'w+') as fout:
print(text, file=fout)
filename += 1
Call PlaintextCorpusReader:
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
corpus_words = newcorpus.words()
fdist = FreqDist(corpus_words)
print(fdist)
I have a question -- Right now I have code that imports a CSV file where the first column is full of words in the following format:
This
Is
The
Format
Once this CSV file is uploaded and read by Python, I want to be able to tag these words using an NLTK POS Tagger. Right now, my code goes like this
Import CSV
with open(r'C:\Users\jkk\Desktop\python.csv', 'r') as f:
reader = csv.reader(f)
J = []
for row in reader:
J.extend(row)
import nltk
nltk.pos_tag(J)
print(J)
However, when I print it, I only get:
['This ', 'Is ', 'The', 'Format']
without the POS Tag!
I'm not sure why this isn't working as I am very new to Python 3. Any help would be much appreciated! Thank you!
pos_tag creates and returns a new list; it doesn't modify its argument. Assign the new list back to the same name:
J = nltk.pos_tag(J)
Need some help with it! Sorry if it's sound stupid.
I am new to python and want to try this example....
but labeling was made manually which is hard work if I have two .txt files(pos and neg) each with 1000 tweets.
Using example above how can I use it with text files?
If I understood correctly, you need to figure out a way of reading text file into a Python object.
Considering you have two text files that contain positive and negative samples (pos.txt and neg.txt) with one tweet per line:
train_samples = {}
with file('pos.txt', 'rt') as f:
for line in f.readlines():
train_samples[line] = 'pos'
Repeat the above loop for negative tweets and you are done populating your train_samples.
You should look for the genfromtxt function from the numpy package : http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
It return a matrix, given the right parameters (delimiters, newline char, ... )