Extract a column from a text file into a list - python

I have a text file which contains a table comprised of numbers e.g:
5,10,6
6,20,1
7,30,4
8,40,3
9,23,1
4,13,6
if for example I want the numbers contained only in the third column, how do i extract that column into a list?
I have tried the following:
myNumbers.append(line.split(',')[2])

The strip method will make sure that the newline character is stripped off. The split method is used here to make sure that the commas are used as a delimiter.
line.strip().split(',')[2]

Related

Python Reading Variable Whitespace Text Table Format

I have this weird output from another tool that I cannot change or modify that I need to parse and do analysis on. Any ideas on what pandas or python library i should use? It has this space filling between columns so that each column start is aligned properly which makes it difficult. White space and tabs are not the delimiter.
If the columns are consistent and every cell has a value, it should actually be pretty easy to parse this manually. You can do some variation of:
your_file = 'C:\\whatever.txt'
with open(your_file) as f:
for line in f:
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = line.strip().split()
# analysis
If you have strings with spaces, you can alternatively do:
for line in f:
row = (cell.strip() for cell in line.split(' ') if cell.strip())
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = row
# analysis
This splits the line by two spaces each, and uses a list comprehension to strip excess whitespace/tabs/newlines from all values in the list before removing any subsequently empty values. If you have strings that can contain multiple spaces... that would be an issue. You'd need a way to ensure those strings were enclosed in quotes so you could isolate them.
If by "white space and tabs are not the delimiter" you meant that the whitespaces uses some weird blank unicode character, you can just do
row = (cell.strip() for cell in line.split(unicode_character_here) if cell.strip())
Just make sure that for any of these solutions, you remember to add some checks for those ===== dividers and add a way to detect the titles of the columns if necessary.

Reading text file into pandas which contains escape characters results in missing data

I have a long list of regex patters that I want replace with in some text data, so I created a text file with all the regex patterns and the replacements I want for them, however, when reading this text file as a pandas dataframe, one particular regex pattern is missing in the dataframe. I tried to read it as a text file instead and build a pandas dataframe out of it but since some of the regex patterns contain "," in them, it becomes difficult to separate each line as 2 different columns.
the key(regex_pattern), value(the values I want to replace them with look as follows):
key,value
"\bnan\b|\bnone\b"," "
"^\."," "
"\s+"," "
"\.+","."
"-+","-"
"\(\)"," "
when I read this as a dataframe using the following lines of code,
normalization_mapper = pd.read_csv(f"{os.getcwd()}/normalize_dict.txt", sep=",")
I also tried the datatype argument as follows
dtype={'key':'string', 'value':'string'}
what I can't figure out is how to handle this "\(\)" missing pattern in the dataframe, pretty sure it's related to the escape character formatting by the pandas data reader, is there a way to handle this? Also, weirdly escape characters in other rows have no issues when reading the text file.

How to access and manipulate individual elements in a csv file?

I'm trying to do some pre-processing of some data in a csv file. The file contains information on various ramen noodles. The 3rd element of each row int the file contains a string of anywhere from 1 or 2 up to 10 words. These words describe the ramen (An example: "Spicy Noodle Chili Garlic Korean", or "Cup Noodles Chicken", etc).
There are over 2,500 reviews and I'm trying to keep track of the 100 most-used words for the descriptions across all the ramens. I then go back through my data, keeping only the words that occur in the 100 most-used. I discard the rest.
For reference, my header looks like this:
Review #,Brand,Variety,Style,Country,Stars,Top Ten
I'm not quite sure how to access the individual words within each description. By description, I'm referring to the 'variety' column.
As a way to test, I have something like:
reader = csv.reader(open('ramen-ratings.csv', 'r'))
outputfile = open('variety.txt', 'w')
next(reader)
for line in reader:
for word in line[2]:
print(word)
But this only prints each individual character, one at a time, on their own line. It's not recognizing the individual words within the string, but instead the individual characters.
Pretty basic question I know, but I'm super new to python so could use some help. Thanks!
Instead of
for word in line[2]:
use
for word in line[2].split():
The explanation:
line[2] is — as you wrote — the string of words. By iterating over the string you iterate over its individual characters.
The .split() method on the other hand returns the list of individual words of that string (which is what you want).
Since line[2] is a string, iterating over it means iteration over each character. If you want to iterate over each word, you should split the string to words.
You can use the split function for this purpose, which by default splits by space one string to list of strings (unless you provide another character to split by):
for line in reader:
for word in line[2].split():
print(word)

replacing a character in a line of string

I have a .txt file with 20 lines. Each line carrying 10 zeroes separated by comma.
0,0,0,0,0,0,0,0,0,0
I want to replace every fifth 0 with 1 in each line. I tried .replace function but I know there must be some easy way in python
You can split the text string with the below command.
text_string="0,0,0,0,0,0,0,0,0,0"
string_list= text_string.split(",")
Then you can replace every fifth element in the list string_list using insert command.
for i in range(4,len(string_list),5):
string_list.insert(i,"1")
After this join the elements of the list using join method
output = "".join([str(i)+"," for i in string_list])
The output for this will be :
'0,0,0,0,1,0,0,0,0,1,0,0,'
This is one way of doing
If text in this File follows some rule, you can parse it as CSV file, and change every fifth index and rewrite it to a new file.
But if you want to modify the existing text file, like replace the character then you can use seek refer to How to modify a text file?

trying to turn a list of strings into a list of a list of strings

I have a very long .txt file of bbcoded data. I've split each sampling of data into a separate item in a list:
import re
file = open('scratch.txt', 'r')
file = file.read()
# split each dial into a separate entry in list
alldials = file.split('\n\n')
adials = []i
for dial in alldials:
re.split('b|d|c', dial)
adials.append(dial)
print(adials[1])
print(adials[1][8])
so that prints a string of data and the 9th character in the string. But the string is not split by the letters used in the argument, or really split at all unless the print command specifically asks for that second index....
what I'd like to split it by are these strings: '\s\s[b]', '[\b]', [dial], [\dial], [icon], and [\icon], but as I started running into problems, I simplified the code down more and more, to figure out what was going wrong, and now I'm as simple as I can make it and I guess I'm missunderstanding a fundamental part of split() or the re module.
the problem is that re.split does not modify the string in place, it returns it as a new string, which means if you want to split it you should do something like this:
split_dial = re.split('b|d|c', dial)
adials.append(split_dial)

Categories

Resources