How to access and manipulate individual elements in a csv file? - python

I'm trying to do some pre-processing of some data in a csv file. The file contains information on various ramen noodles. The 3rd element of each row int the file contains a string of anywhere from 1 or 2 up to 10 words. These words describe the ramen (An example: "Spicy Noodle Chili Garlic Korean", or "Cup Noodles Chicken", etc).
There are over 2,500 reviews and I'm trying to keep track of the 100 most-used words for the descriptions across all the ramens. I then go back through my data, keeping only the words that occur in the 100 most-used. I discard the rest.
For reference, my header looks like this:
Review #,Brand,Variety,Style,Country,Stars,Top Ten
I'm not quite sure how to access the individual words within each description. By description, I'm referring to the 'variety' column.
As a way to test, I have something like:
reader = csv.reader(open('ramen-ratings.csv', 'r'))
outputfile = open('variety.txt', 'w')
next(reader)
for line in reader:
for word in line[2]:
print(word)
But this only prints each individual character, one at a time, on their own line. It's not recognizing the individual words within the string, but instead the individual characters.
Pretty basic question I know, but I'm super new to python so could use some help. Thanks!

Instead of
for word in line[2]:
use
for word in line[2].split():
The explanation:
line[2] is — as you wrote — the string of words. By iterating over the string you iterate over its individual characters.
The .split() method on the other hand returns the list of individual words of that string (which is what you want).

Since line[2] is a string, iterating over it means iteration over each character. If you want to iterate over each word, you should split the string to words.
You can use the split function for this purpose, which by default splits by space one string to list of strings (unless you provide another character to split by):
for line in reader:
for word in line[2].split():
print(word)

Related

trying to turn a list of strings into a list of a list of strings

I have a very long .txt file of bbcoded data. I've split each sampling of data into a separate item in a list:
import re
file = open('scratch.txt', 'r')
file = file.read()
# split each dial into a separate entry in list
alldials = file.split('\n\n')
adials = []i
for dial in alldials:
re.split('b|d|c', dial)
adials.append(dial)
print(adials[1])
print(adials[1][8])
so that prints a string of data and the 9th character in the string. But the string is not split by the letters used in the argument, or really split at all unless the print command specifically asks for that second index....
what I'd like to split it by are these strings: '\s\s[b]', '[\b]', [dial], [\dial], [icon], and [\icon], but as I started running into problems, I simplified the code down more and more, to figure out what was going wrong, and now I'm as simple as I can make it and I guess I'm missunderstanding a fundamental part of split() or the re module.
the problem is that re.split does not modify the string in place, it returns it as a new string, which means if you want to split it you should do something like this:
split_dial = re.split('b|d|c', dial)
adials.append(split_dial)

Filter special characters, count them, and rewrite on another csv

I am trying to see if there are special characters in csv. This file consist of one column with about 180,000 rows. Since my file contains Korean, English, and Chinese, I added 가-힣``A-Z``0-9 but I do not know what I should to not filter Chinese letters. Or is there any better way to do this?
Special letters I am looking for are : ■ , △, ?, etc
Special letters I do not want to count are : Unit (ex : ㎍, ㎥, ℃), (), ' etc.
Searching on stackflow, many questions considered designating special letters to find out first. But in my case, that is difficult since I have 180,000 records and I do not know what letters are actually in there. As far as I am concerned, there are only three languages ; Korean, English, and Chinese.
This is my code so far :
with open("C:/count1.csv",'w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
with open(file,'r') as fi:
for line in fi:
x=not('가-힣','A-Z','0-9')
if x in line :
sub=re.sub(x,'*',line.rstrip())
count=len(sub)
lst=[fi]+[count]
csv_writer.writerow(lst)
Using import re
regex=not'[가-힣]','[a-z]','[0-9]'
file="C:/kd/fields.csv"
with open("C:/specialcharacter.csv",'w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
with open(file,'r') as fi:
for line in fi:
search_target = line
result=re.findall(regex,search_target)
print("\n".join(result))
I do not know why you consider not filtering chinese characters when you are only looking for some special letters. This library can filter chinese.
filter Chinese on top of your filtered list of Korean, English and number: regex = "[^가-힣a-zA-Z0-9]" result=re.findall(regex,search_target)
filter either 1) a list of special characters that you seek or 2) a list of special characters you want to avoid.
Choose wisely which fits your case better to avoid as much exceptions as possible so that you do not have to add more filters everytime.
Make the list as regex.
Then, loop through your 180,000 rows using regex to filter out the rows.
Update your regex-list until you filter everything.

split() not splitting all white spaces?

I am trying to take a text document and write each word separately into another text document. My only issue is with the code I have sometimes the words aren't all split based on the white space and I'm wondering if I'm just using .split wrong? If so, could you explain why or what to do better?
Here's my code:
list_of_words = []
with open('ExampleText.txt', 'r') as ExampleText:
for line in ExampleText:
for word in line.split(''):
list_of_words.append(word)
print("Done!")
print("Also done!")
with open('TextTXT.txt', 'w') as EmptyTXTdoc:
for word in list_of_words:
EmptyTXTdoc.write("%s\n" % word)
EmptyTXTdoc.close()
This is the first line in the ExampleText text document as it is written in the newly created EmptyTXTdoc:
Submit
a personal
statement
of
research
and/or
academic
and/or
career
plans.
Use .split() (or .split(' ') for only spaces) instead of .split('').
Also, consider sanitizing the line with .strip() for every iteration of the file, since the line is accepted with a newline (\n) in its end.
.split('') Will not remove a space because there isn't a space in between the two apostrophes. You're telling it to split on, well, nothing.

Nested for loop in python does not increment the outer for loop

I have two files: q.txt contains words and p.txt contains sentences. I need to check if any of the words in q.txt is present in p.txt. Following is what I wrote:
#!/usr/bin/python
twts=open('p.txt','r');
words=open('q.txt','r');
for wrd in words:
for iter in twts:
if (wrd in iter):
print "Found at line" +iter
It does not print the output even if there is a match. Also I could see that the outer for loop does not proceed to the next value in the words object. Could someone please explain what am I doing wrong here?
Edit 1: I'm using Python 2.7
Edit 2: Sorry I've mixed up the variable names. Have corrected it now.
When you iterate over a file object, after completing the iteration the cursor end up at the end of the file . So trying to iterate over it again (in the next iteration of the outer for loop) would not work. The easiest way for your code to work would be to seek to the starting of the file at the start of the outer for loop. Example -
#!/usr/bin/python
words=open('q.txt','r');
twts=open('p.txt','r');
for wrd in words:
twts.seek(0)
for twt in twts:
if (wrd.strip() in twt):
print "Found at line" +iter
Also, according to the question , seems like you are using wrong files , twts should be the file with sentences, and words the file with words. But you have openned p.txt for words , and q.txt for `sentences. If its opposite, you should open the files otherway round.
Also, would advice against using iter as a variable name, as that is also the name of a built-in function , and you defining it in - for iter in twts - shadows the built-in function - iter() .
Would be better if you had posted the content of the files but have you striped the \n from the lines? This works for me:
words = open('words.txt', 'r')
twts = open('sentences.txt', 'r')
for w in words:
for t in twts:
if w.rstrip('\n') in t.rstrip('\n'):
print w, t
It seems that you mixed up the 2 files. You say that q.txt contains the words, but you stored p.txt into the words variable.
When you iterate over tweets once you have exhausted the iterator, you the pointer is at the end of the file so there is nothing to iterate over after the first iteration, you can seek repeatedly but if words is not a huge file you can make a set of all the words so you only iterate over the sentences file once for an 0(n*k) running time as opposed to a quadratic solution reading every single line for every word in your words file, splitting will also match exact words not substrings:
from string import punctuation
with open('p.txt','r') as twts, open('q.txt','r') as words:
st = set(map(str.rstrip,words))
for line in twts:
if any(word.rstrip(punctuation) in st for word in line.split()):
print("Found at line {}".format(line))

Building a Markov model from a text file?

I have an assignment to build a program that, based off an input file, reads text and then generates new text. The dictionary should map n string of letters to a list of letters that could follow the string, based off the text in the input file. Thus far, I have
def create_dic():
n = order_entry.get()
inputfile = file_entry.get() #name of input file
lines = open(inputfile,'r').read() #reads input file into string
model = {} #empty dictionary to build Markov model
For every n-character sequence in the input, I have to "look it up in the dictionary to get a list of possible succeeding characters and get the next character." I'm confused about instruction to look up the string in the dictionary when the dictionary is empty to begin with? Won't there be nothing in the dictionary?
Since this is an assignment, I will give you leading questions rather than an answer. As #Quirliom said, "Populate the dictionary."
When you want to use the Markov model, what key would you like to search the dictionary for?
When you search for that key, what would you like to get back?
The sentence, "The dictionary should map n string of letters to a list of letters that could follow the string, based off the text in the input file," has the answers to those questions. This means that you will have to do some work on the input file to figure out how to extract the dictionary keys and what they should map to.
this is definitely not the best approach, but you start with this.
on letter basis: which letter is in the first place most (for entire data).
first character (letter) of the words are countable entities. and it is rational to check which character (letter) has most records. start your generated text with this. then look which letter most succeeds this and so on. also take average word length and distribute generated words around this length.
for better results:
on n gram basis: which n-gram is most likely to precede others (you can extend it also as sentences)

Categories

Resources