how to label sentiment polarity in big text file?

how to label sentiment polarity in big text file? - python

Need some help with it! Sorry if it's sound stupid.
I am new to python and want to try this example....
but labeling was made manually which is hard work if I have two .txt files(pos and neg) each with 1000 tweets.
Using example above how can I use it with text files?

If I understood correctly, you need to figure out a way of reading text file into a Python object.
Considering you have two text files that contain positive and negative samples (pos.txt and neg.txt) with one tweet per line:
train_samples = {}
with file('pos.txt', 'rt') as f:
for line in f.readlines():
train_samples[line] = 'pos'
Repeat the above loop for negative tweets and you are done populating your train_samples.

You should look for the genfromtxt function from the numpy package : http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
It return a matrix, given the right parameters (delimiters, newline char, ... )

Related

Get different strings from a file and write a .txt

I'am trying to get lines from a text file (.log) into a .txt document.
I need get into my .txt file the same data. But the line itself is sometimes different. From what I have seen on internet, it's usualy done with a pattern that will anticipate how the line is made.
1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1
The line i'm working with usualy looks like this. The data i'am trying to collect are :
\ip\0.0.0.0
\name\NickName_of_the_player
\ja_guid\420D990471FC7EB6B3EEA94045F739B7
And print these data, inside a .txt file. Here is my current code.
As explained above, i'am unsure about what keyword to use for my research on google. And how this could be called (Because the string isn't the same?)
I have been looking around alot, and most of the test I have done, have allowed me to do some things, but i'am not yet able to do as explained above. So i'am in hope for guidance here :) (Sorry if i'am noobish, I understand alot how it works, I just didn't learned language in school, I mostly do small scripts, and usualy they work fine, this time it's way harder)
def readLog(filename):
with open(filename,'r') as eventLog:
data = eventLog.read()
dataList = data.splitlines()
return dataList
eventLog = readLog('games.log')

You'll need to read the files in "raw" mode rather than as strings. When reading the file from disk, use open(filename,'rb'). To use your example, I ran
text_input = r"1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1"
text_as_array = text_input.split('\\')
You'll need to know which columns contain the strings you care about. For example,
with open('output.dat','w') as fil:
fil.write(text_as_array[6])
You can figure these array positions from the sample string
>>> text_as_array[6]
'46.98.134.211:24806'
>>> text_as_array[34]
'Faybell'
>>> text_as_array[44]
'420D990471FC7EB6B3EEA94045F739B7'
If the column positions are not consistent but the key-value pairs are always adjacent, we can leverage that
>>> text_as_array.index("ip")
5
>>> text_as_array[text_as_array.index("ip")+1]
'46.98.134.211:24806'

Not getting the full output out of a list

Objective
I'm trying to extract the GPS "Latitude" and "Longitude" data from a bunch of JPG's and I have been successful so far but my main problem is that when I try to write the coordinates to a text file for example I see that only 1 set of coordinates was written compared to my console output which shows that every image was extracted. Here is an example: Console Output and here is my text file that is supposed be a mirror output along my console: Text file
I don't fully understand whats the problem and why it won't just write all of them instead of one. I believe it is being overwritten somehow or the 'GPSPhoto' module is causing some issues.
Code
from glob import glob
from GPSPhoto import gpsphoto
# Scan jpg's that are located in the same directory.
data = glob("*.jpg")
# Scan contents of images and GPS values.
for x in data:
data = gpsphoto.getGPSData(x)
data = [data.get("Latitude"), data.get("Longitude")]
print("\nsource: {}".format(x), "\n ↪ {}".format(data))
# Write coordinates to a text file.
with open('output.txt', 'w') as f:
print('Coordinates:', data, file=f)
I have tried pretty much everything that I can think of including: changing the write permissions, not using glob, no loops, loops, lists, no lists, different ways to write to the file, etc.
Any help is appreciated because I am completely lost at this point. Thank you.

You're replacing the data variable each time through the loop, not appending to a list.
all_coords = []
for x in data:
data = gpsphoto.getGPSData(x)
all_coords.append([data.get("Latitude"), data.get("Longitude")])
with open('output.txt', 'w') as f:
print('Coordinates:', all_coords, file=f)

Extract a value from each text file obeying a naming convention - how?

I need to extract the last number in the last line of each text file in a directory. Can someone get me started on this in Python? The data is information formatted as follows:
# time 'A' 'B'
0.000000E+00 10000 0
1.000000E+05 7742 2263
where the '#' column is empty in each file. The filenames obey the following naming convention:
for i in `seq 1 100`; for j in `seq 1 101`; for letter in {A..D};
filename = $letter${j}_${i}.txt
These files contain the resulting data from running simulations in KaSim (Kappa language). I want to take the averages of subsets of the extracted numbers and plot some results.
Matlab can't handle the set of 50,000 files I'm dealing with. I'm relatively new to Python but I have experience in Matlab and R. I want to do the data extraction through Python and the analysis in Matlab or R.
Thanks for any help.

This code should get you started. As far as the directory has only those files for which you need the last number, the naming convention can be ignored. Because, you can rather look up all of the file in that directory.
import glob
last_numbers = []
for filename in glob.glob("/path/to/directory/*"): # dont forget this ending * (its wild character)
last_number = file.open(filename).readlines()[-1].split(" ")[-1]
# in case last line is empty line '\n' and your interest is in last second line then it should be '.readlines()[-2].split(" ")[-1]'
last_numbers.append(last_number)

Tab separated data python

Must start that I am very new to Python and very bad at it still, but believe that it will be worth it to learn eventually.
My problem is that I have this device that prints out the values in a .txt but seperated by tabs instead of commas. Ex: 50\t50\t66\t0\t4...
And what I want is just plot a simple Histogram with that data.
I do realise that it should be the simplest thing but somehow I am having trouble with it finding a solution from my python nooby lectures nor can I really word this well enough to hit a search online.
import matplotlib.pyplot as plt
#import numpy as np
d = open('.txt', 'r')
d.read()
plt.hist(d)
plt.show()
PS: numpy is just a remainder from one of my previous exercises

No worries, everyone must start somewhere. You are on the right track, and are correct Python is a great language to learn. There are many was this can be accomplished, but here is one way. With the way this example written, it will generate one histogram graph per line in the file. You can modify or change that behavior if needed.
Please note that the CSV module will take care of converting the data in the file to floats by passing the quoting=csv.QUOTE_NONNUMERIC to the constructor of reader. This is probably the preferred method to handling number conversion in a CSV / TSV file.
import csv
import matplotlib.pyplot as plt
data_file = open('testme.txt')
tsv_reader = csv.reader(data_file, delimiter='\t',
quoting=csv.QUOTE_NONNUMERIC)
for row in tsv_reader:
plt.hist(row)
plt.show()
I've left out some things such as proper exception handling, and using a context manager to open to file as is best practice and demonstrated in the csv module documentation.
Once you learn more about the language, I'd suggest digging into those subjects further.

Assign the string result of read() to a variable s:
s = d.read()
split will break your string s into a list of strings:
s = s.split("\t")
map will apply a function to every element of a list:
s = map(float, s)

If you study csv you can handle the file with delimiter='\t' as one of the options. This will change the expected delimiter from ',' to '\t' (tab. All the examples that you study that use the ',' will be handled in the same way.

Creating a Term Document Matrix from Text File

I'm trying to read one text file and create a term document matrix using textmining packages. I can create term document matrix where I need to add each line by line. The problem is that I want to include whole file at a time. What am I missing in the following code? Thanks in advance for any suggestion?
import textmining
def term_document_matrix_roy_1():
'''-----------------------------------------'''
with open("data_set.txt") as f:
reading_file_line = f.readlines() #entire content, return list
print reading_file_line #list
reading_file_info = [item.rstrip('\n') for item in reading_file_line]
print reading_file_info
print reading_file_info [1] #list-1
print reading_file_info [2] #list-2
'''-----------------------------------------'''
tdm = textmining.TermDocumentMatrix()
#tdm.add_doc(reading_file_info) #Giving error because of readlines
tdm.add_doc(reading_file_info[0])
tdm.add_doc(reading_file_info[1])
tdm.add_doc(reading_file_info[2])
for row in tdm.rows(cutoff=1):
print row
Sample Text files: "data_set.txt" contain following information:
Lets write some python code
Thus far, this book has mainly discussed the process of ad hoc retrieval.
Along the way we will study some important machine learning techniques.
Output will be Term Document Matrix, basically how many times one specific word appear.
Output Image: http://postimg.org/image/eidddlkld/

If I'm understanding you correctly, you're currently adding each line of your file as a separate document. To add the whole file, you could just concatenate the lines, and add them all at once.
tdm = textmining.TermDocumentMatrix()
#tdm.add_doc(reading_file_info) #Giving error because of readlines
tdm.add_doc(' '.join(reading_file_info))
If you are looking for multiple matrices, you'll end up getting only one row in each, as there is only one document, unless you have another way of splitting the line in to separate documents. You may want to re-think whether this is what you actually want. Nevertheless, I think this code will do it for you:
with open("txt_files/input_data_set.txt") as f:
tdms = []
for line in f:
tdm = textmining.TermDocumentMatrix()
tdm.add_doc(line.strip())
tdms.append(tdm)
for tdm in tdms:
for row in tdm.rows(cutoff=1):
print row
I haven't really been able to test this code, so the output might not be right. Hopefully it will get you on your way.

#Fred Thanks for reply. I want to show as it I showed in the image file. Actually the same result I able to produce using following code, but I want each line as separate matrix not one matrix.
with open("txt_files/input_data_set.txt") as f:
reading_file_info = f.read()#reading lines exact content
reading_file_info=f.read
tdm = textmining.TermDocumentMatrix()
tdm.add_doc(reading_file_info)
tdm.write_csv('txt_files/input_data_set_result.txt', cutoff=1)
for row in tdm.rows(cutoff=1):
print row
What I'm trying is reading a text file and create a term document matrix.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to label sentiment polarity in big text file? - python

Need some help with it! Sorry if it's sound stupid. I am new to python and want to try this example.... but labeling was made manually which is hard work if I have two .txt files(pos and neg) each with 1000 tweets. Using example above how can I use it with text files?

You should look for the genfromtxt function from the numpy package : http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html It return a matrix, given the right parameters (delimiters, newline char, ... )

Related

Get different strings from a file and write a .txt

Not getting the full output out of a list

Extract a value from each text file obeying a naming convention - how?

Tab separated data python

Creating a Term Document Matrix from Text File

Categories

Resources