Creating a Term Document Matrix from Text File

Creating a Term Document Matrix from Text File - python

I'm trying to read one text file and create a term document matrix using textmining packages. I can create term document matrix where I need to add each line by line. The problem is that I want to include whole file at a time. What am I missing in the following code? Thanks in advance for any suggestion?
import textmining
def term_document_matrix_roy_1():
'''-----------------------------------------'''
with open("data_set.txt") as f:
reading_file_line = f.readlines() #entire content, return list
print reading_file_line #list
reading_file_info = [item.rstrip('\n') for item in reading_file_line]
print reading_file_info
print reading_file_info [1] #list-1
print reading_file_info [2] #list-2
'''-----------------------------------------'''
tdm = textmining.TermDocumentMatrix()
#tdm.add_doc(reading_file_info) #Giving error because of readlines
tdm.add_doc(reading_file_info[0])
tdm.add_doc(reading_file_info[1])
tdm.add_doc(reading_file_info[2])
for row in tdm.rows(cutoff=1):
print row
Sample Text files: "data_set.txt" contain following information:
Lets write some python code
Thus far, this book has mainly discussed the process of ad hoc retrieval.
Along the way we will study some important machine learning techniques.
Output will be Term Document Matrix, basically how many times one specific word appear.
Output Image: http://postimg.org/image/eidddlkld/

If I'm understanding you correctly, you're currently adding each line of your file as a separate document. To add the whole file, you could just concatenate the lines, and add them all at once.
tdm = textmining.TermDocumentMatrix()
#tdm.add_doc(reading_file_info) #Giving error because of readlines
tdm.add_doc(' '.join(reading_file_info))
If you are looking for multiple matrices, you'll end up getting only one row in each, as there is only one document, unless you have another way of splitting the line in to separate documents. You may want to re-think whether this is what you actually want. Nevertheless, I think this code will do it for you:
with open("txt_files/input_data_set.txt") as f:
tdms = []
for line in f:
tdm = textmining.TermDocumentMatrix()
tdm.add_doc(line.strip())
tdms.append(tdm)
for tdm in tdms:
for row in tdm.rows(cutoff=1):
print row
I haven't really been able to test this code, so the output might not be right. Hopefully it will get you on your way.

#Fred Thanks for reply. I want to show as it I showed in the image file. Actually the same result I able to produce using following code, but I want each line as separate matrix not one matrix.
with open("txt_files/input_data_set.txt") as f:
reading_file_info = f.read()#reading lines exact content
reading_file_info=f.read
tdm = textmining.TermDocumentMatrix()
tdm.add_doc(reading_file_info)
tdm.write_csv('txt_files/input_data_set_result.txt', cutoff=1)
for row in tdm.rows(cutoff=1):
print row
What I'm trying is reading a text file and create a term document matrix.

Related

Get different strings from a file and write a .txt

I'am trying to get lines from a text file (.log) into a .txt document.
I need get into my .txt file the same data. But the line itself is sometimes different. From what I have seen on internet, it's usualy done with a pattern that will anticipate how the line is made.
1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1
The line i'm working with usualy looks like this. The data i'am trying to collect are :
\ip\0.0.0.0
\name\NickName_of_the_player
\ja_guid\420D990471FC7EB6B3EEA94045F739B7
And print these data, inside a .txt file. Here is my current code.
As explained above, i'am unsure about what keyword to use for my research on google. And how this could be called (Because the string isn't the same?)
I have been looking around alot, and most of the test I have done, have allowed me to do some things, but i'am not yet able to do as explained above. So i'am in hope for guidance here :) (Sorry if i'am noobish, I understand alot how it works, I just didn't learned language in school, I mostly do small scripts, and usualy they work fine, this time it's way harder)
def readLog(filename):
with open(filename,'r') as eventLog:
data = eventLog.read()
dataList = data.splitlines()
return dataList
eventLog = readLog('games.log')

You'll need to read the files in "raw" mode rather than as strings. When reading the file from disk, use open(filename,'rb'). To use your example, I ran
text_input = r"1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1"
text_as_array = text_input.split('\\')
You'll need to know which columns contain the strings you care about. For example,
with open('output.dat','w') as fil:
fil.write(text_as_array[6])
You can figure these array positions from the sample string
>>> text_as_array[6]
'46.98.134.211:24806'
>>> text_as_array[34]
'Faybell'
>>> text_as_array[44]
'420D990471FC7EB6B3EEA94045F739B7'
If the column positions are not consistent but the key-value pairs are always adjacent, we can leverage that
>>> text_as_array.index("ip")
5
>>> text_as_array[text_as_array.index("ip")+1]
'46.98.134.211:24806'

Not getting the full output out of a list

Objective
I'm trying to extract the GPS "Latitude" and "Longitude" data from a bunch of JPG's and I have been successful so far but my main problem is that when I try to write the coordinates to a text file for example I see that only 1 set of coordinates was written compared to my console output which shows that every image was extracted. Here is an example: Console Output and here is my text file that is supposed be a mirror output along my console: Text file
I don't fully understand whats the problem and why it won't just write all of them instead of one. I believe it is being overwritten somehow or the 'GPSPhoto' module is causing some issues.
Code
from glob import glob
from GPSPhoto import gpsphoto
# Scan jpg's that are located in the same directory.
data = glob("*.jpg")
# Scan contents of images and GPS values.
for x in data:
data = gpsphoto.getGPSData(x)
data = [data.get("Latitude"), data.get("Longitude")]
print("\nsource: {}".format(x), "\n ↪ {}".format(data))
# Write coordinates to a text file.
with open('output.txt', 'w') as f:
print('Coordinates:', data, file=f)
I have tried pretty much everything that I can think of including: changing the write permissions, not using glob, no loops, loops, lists, no lists, different ways to write to the file, etc.
Any help is appreciated because I am completely lost at this point. Thank you.

You're replacing the data variable each time through the loop, not appending to a list.
all_coords = []
for x in data:
data = gpsphoto.getGPSData(x)
all_coords.append([data.get("Latitude"), data.get("Longitude")])
with open('output.txt', 'w') as f:
print('Coordinates:', all_coords, file=f)

Reading selected parts of a file of unknown name

I am having a lot of datafiles with unknown names. I have figured out a way to get them all read and printed but I want to make graphs of them so I need the data in a way that is workable.
The datafiles are very neatly arranged (every line of the header contains information on what is stored there) but I am having trouble making a script that selects the data I need. The first 50+ lines of the file contain headers of which I need only a few to be used, this poses no problem when using something like:
for filename in glob.glob(fullpath):
with open(filename, 'r') as f:
for line in f:
if 'xx' in line:
Do my thing
if 'yy' in line:
Do my thing etc.
But below the headers there is a block of data of undetermined number of columns and undetermined number of lines (number of columns and what each column is, is specified in the headers). This I can't get read in a way that a graph can be made by for example matplotlib. (I can get it right by manually copying the data to a separate file and read that to a plottable format but that is not what I want to do every time of every file...) The line before the data starts contains the very useful #eoh but I can't figure out a way to combine the selective reading of the first 50+ lines and then swith to reading everything into an array. If there are methods to do what I want in a better way (including the selection of the map and seeing which files are there and readable) I am open to suggestions.
Update:
The solution proposed by #ImportanceOfBeingErnest seems very useful but I don't get it to work.
So I'll start with the data mentioned as missing in the answer.
Columnnames are given in the following format:
#COLUMNINFO= NUMBER1, UNIT, MEASUREMENT, NUMBER2
In this format number1 is the columnnumber, unit is the unit of the measurement, measurement is what is measured and number2 is in numbers what is measured.
The data is separated by spaces but that won't be a problem, I suspect.
I tried to implement the reading of the headers in the loop to determine the end of the headers, which failed to have any visible effects, even the print commands to check intermediate results did not show.
Once I put 'print line' after 'for line in f:' I thought I could see what went wrong but it appears as if the whole loop is ignored, including the break command which causes an error since the file is done reading and no data is left to read for the other parts...
Any help would be appreciated.

First of all, if the header has a certain character at the beginning of each line, this can be used to filter the header out automatically. Using numpy you could use numpy.loadtxt(filename, delimiter=";", comment="#") to load the data and every line starting with # would simply be ignored.
I don't know if this is the case here?!
In the case that you describe, where you have a header-ending flag #eoh you could first read in the header line by line to find out how many lines you later need to ignore and then use that number when loading the file.
I have assembled a little example, how it could work.
def readFile(filename):
#first find the number of lines to skip in the header
eoh = 0
with open(filename, "r") as f:
for line in f:
eoh = eoh+1
if "#eoh" in line:
break
# now at this point we need to find out about the column names
# but as no data is given as example, this is impossible
columnnames = []
# load the file by skipping eoh lines,
# the rest should be well behaving
a = np.genfromtxt(filename, skip_header = eoh, delimiter=";" )
return a, columnnames
def plot(a, columnnames, show=True, save=False, filename="something"):
fig = plt.figure()
ax = fig.add_subplot(111)
# plot the forth column agains the second
ax.plot(a[:, 1], a[:,3])
# if we had some columnname, we could also plot
# column named "ulf" against the one named "alf"
#ax.plot(a[:, columnnames.index("alf")], a[:,columnnames.index("ulf")])
#now save and/or show
if save:
plt.savefig(filename+".png")
if show:
plt.show()
if __name__ == "__main__":
fullpath = "path/to/files/file*.txt" # or whatever
for filename in glob.glob(fullpath):
a, columnnames = readFile(filename)
plot(a, columnnames, show=True, save=False, filename=filename[:-4])
One remaining problem is the names of the columns. Since you did not provide any example data, it's hard to estimate how to exactly do that.
This all assumes that you do not have any missing data in between or anything of that kind. If this was the case, then you'd need to use all the arguments to numpy.genfromtxt() to filter the data accordingly.

Iteratively replace two strings with values from numpy array

I'm currently trying to make an automation script for writing new files from a master where there are two strings I want to replace (x1 and x2) with values from a 21 x 2 array of numbers (namely, [[0,1000],[50,950],[100,900],...,[1000,0]]). Additionally, with each double replacement, I want to save that change as a unique file.
Here's my script as it stands:
import numpy
lines = []
x1x2 = numpy.array([[0,1000],[50,950],[100,900],...,[1000,0])
for i,j in x1x2:
with open("filenamexx.inp") as infile:
for line in infile:
linex1 = line.replace('x1',str(i))
linex2 = line.replace('x2',str(j))
lines.append(linex1)
lines.append(linex2)
with open("filename"+str(i)+str(j)+".inp", 'w') as outfile:
for line in lines:
outfile.write(line)
With my current script there are a few problems. First, the string replacements are being done separately, i.e. I end up with a new file that contains the contents of the master file twice where one line has the first change and then the next will reflect the second separately. Second, with each subsequent iteration, the new files have the contents of the previous file prepended (i.e. filename100900.inp will contain its unique contents as well as the contents of both filename01000.inp and filename50950.inp before it). Anyone think they can take a crack at solving my problem?
Note: I've looked at using regex module solutions (somehing like this: https://www.safaribooksonline.com/library/view/python-cookbook-2nd/0596007973/ch01s19.html) in order to do multiple replacements in a single pass, but I'm not sure if the way I'm indexing is translatable to a dictionary object.

I'm not sure I understood the second issue but you can use replace more than one time on the same string, so:
s = "x1 stuff x2"
s = s.replace('x1',str(1)).replace('x2',str(2))
print(s)
, will output:
1 stuff 2
No need to do this two times for two different variables. As for the second issue it just seems as your not "reset-ing" the "lines" variable before starting to write a new file. So once you finish writing a file just add:
lines = []
It should be enough to solve these issues.

how to label sentiment polarity in big text file?

Need some help with it! Sorry if it's sound stupid.
I am new to python and want to try this example....
but labeling was made manually which is hard work if I have two .txt files(pos and neg) each with 1000 tweets.
Using example above how can I use it with text files?

If I understood correctly, you need to figure out a way of reading text file into a Python object.
Considering you have two text files that contain positive and negative samples (pos.txt and neg.txt) with one tweet per line:
train_samples = {}
with file('pos.txt', 'rt') as f:
for line in f.readlines():
train_samples[line] = 'pos'
Repeat the above loop for negative tweets and you are done populating your train_samples.

You should look for the genfromtxt function from the numpy package : http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
It return a matrix, given the right parameters (delimiters, newline char, ... )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.