Cleaning text file data after reading line by line without using pandas

Cleaning text file data after reading line by line without using pandas - python

Like lets say i have text file data like this..
|-------|
|Arsenal|
|-------|
|2021
|-------|
|Rnd|A|W|D|L|Venu|Date|
|R1|Tottenham|1|0|0|Emirates|March|
|R2|Man utd|0|1|0|Old Trafford|March|
|Total|Average|1234|5678|
|Arsenal|
|-------|
|2020|
|-------|
|Rnd|A|W|D|L|Venu|Date|
|R1|Chelsea|1|0|0|Stamford Bridge|March|
|R2|Mancity|0|1|0|Ethiad|March|
|Total|Average|1234|5678|
I want to convert this file in to 2D array (list of list) without using pandas. And hoping for output like this
Arsenal 2021 R1 Tottenham 1 0 0 Emirates March
Arsenal 2021 R2 Man utd 0 1 0 Old Trafford March
Arsenal 2020 R1 Chelsea 1 0 0 Stamford Bridge March
Arsenal 2020 R2 Man city 0 1 0 Ethiad March
So here i need to ignore |----|, |Rnd|, |Total|Average|1234|5678|, and i need to make Arsenal and 2021 attached to the every row and Arsenal and 2020 to every row in next year..
I have applied for loop going every line by line and created the list of list. But i couldn't delete the header like (Rnd,T,W,D,L,Venu, Total) and total, average while going through line by line without using pandas...

You can use variable like first_part = True/False to run different code in loop.
You can also use next(file) to read next line(s) from file so in first part you can read more lines to get word and year and set first_part = False. In second part you has to only add this word and year to lines and check if line starts with |Total' to change first_part = True`.
Minimal working example.
I uses io to simulate file but you should use open().
text = '''|Arsenal|
|-------|
|2021
|-------|
|Rnd|A|W|D|L|Venu|Date|
|R1|Tottenham|1|0|0|Emirates|March|
|R2|Man utd|0|1|0|Old Trafford|March|
|Total|Average|1234|5678|
|Arsenal|
|-------|
|2020|
|-------|
|Rnd|A|W|D|L|Venu|Date|
|R1|Chelsea|1|0|0|Stamford Bridge|March|
|R2|Mancity|0|1|0|Ethiad|March|
|Total|Average|1234|5678|'''
import io
#fh = open('data.csv')
fh = io.StringIO(text)
first_part = True
for line in fh:
if first_part:
word = line.rstrip('\n').rstrip('|')
line = next(fh)
line = next(fh)
year = line.rstrip('\n').rstrip('|')
line = next(fh)
line = next(fh)
first_part = False
else:
if line.startswith('|Total|'):
first_part = True
else:
new_line = word + year + line
print(new_line, end='')
Result:
|Arsenal|2021|R1|Tottenham|1|0|0|Emirates|March|
|Arsenal|2021|R2|Man utd|0|1|0|Old Trafford|March|
|Arsenal|2020|R1|Chelsea|1|0|0|Stamford Bridge|March|
|Arsenal|2020|R2|Mancity|0|1|0|Ethiad|March|

with open('Arsenal.txt', 'r') as f:
for line in f:
if not line.startswith(('| --- |', '| Rnd |','| Totals |','| Averages |')) :
line= line.strip()
field= line.split('|')
print(field)
#furas this is my code I tried

Related

Reading CSV files and pulling specific data

Here is some sample data
Game
Date
HomeTeam
FT
HT
AwayTeam
1
(Fri) 10 Aug 2018 (W32)
Manchester United FC
2-1
1-0
Leicester City FC
2
(Sat) 11 Aug 2018 (W32)
AFC Bournemouth
2-0
1-0
Cardiff City FC
3
(Sat) 11 Aug 2018 (W32)
Fulham FC
0-2
0-1
Crystal Palace FC
Based on the user input provide the total number of goals scored by a specific team throughout the season.
Asks user for the game number and provide names of the both teams and score for the game.
This is what I have so far (note that I'm not allowed to use pandas) ...
def t_goals():
f = open("EPL_18-19_HW2.txt")
next(f)
total_goals = 0
for lines in f:
game = lines.strip().split(',')
goals = game[3].split("-")
for num in goals:
total_goals += int(num)
f.close()
return total_goals

If you are not using pandas then use the inbuilt csv module(https://docs.python.org/3/library/csv).
Read your file like this:
def t_goals():
with open('your_csv_file.csv', 'r') as input:
# create a reader for your file
input_reader = csv.reader(input, delimiter=',')
# skip the first line which has the column names
next(input_reader)
total_goals = 0
# all the lines can be read by iteration
for line in input_reader:
# since all the values a line in a csv file are separated by comma, they are read as a list
# so read the FT column with line[3]
goals = line[3].split("-")
for num in goals:
total_goals += int(num)
# if you open a file using 'with', you don't have to explicitly write the close statement
return total goals

Here are some quick functions I wrote that I think acheive what you want.
For the first problem check if the team you passed to the function is either home or away then get the corresponding score and add it to total_goals:
def get_total_goals(target_team='Manchester United FC'):
total_goals = 0
with open('sample.csv', 'r') as f:
next(f)
for line in f:
current_home_team = line.strip().split(',')[2].strip()
current_away_team = line.strip().split(',')[5].strip()
if current_home_team == target_team:
score = int(line.strip().split(',')[3].split('-')[0])
total_goals += score
elif current_away_team == target_team:
score = int(line.strip().split(',')[3].split('-')[1])
total_goals += score
return total_goals
For the next problem, iterate through the rows and check if the game number equals the game number you've passed into the function. If there's a match, return the required details in a dictionary, if not then "Game not found" is returned.
def get_game_details(game_number=1):
with open('sample.csv', 'r') as f:
next(f)
for line in f:
if int(line.strip().split(',')[0]) == game_number:
return {
'HomeTeam':line.strip().split(',')[2],
'AwayTeam':line.strip().split(',')[5],
'FT':line.strip().split(',')[3],
'HT':line.strip().split(',')[4]
}
return "Game not found"
These should give you a starting point, you can make changes as required for your use case. You can also use the default csv module included in python as mentioned by anotherGatsby in their answer.

How to parse a text file into a dictionary in python with key on one line followed by two lines of values

I have a file with lines in this format:
CALSPHERE 1
1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996
2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319
CALSPHERE 2
1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990
2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421
..etc.
I would like to parse this into a dictionary of the format:
{CALSPHERE 1:(1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996, 2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319),
CALSPHERE 2:(1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990, 2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421),...}
I'm puzzled as to how to parse this, so that every third line is the key, with the following two lines forming a tuple for the value. What would be the best way to do this in python?
I've attempted to add some logic for "every third line" though it seems kind of convoluted; something like
with open(r"file") as f:
i = 3
for line in f:
if i%3=0:
key = line
else:
#not sure what to do with the next lines here

If your file always have the same distribution (i.e: the 'CALSPHERE' word -or any other that you want it as your dictionary key-, followed by two lines), you can achieve what you want by doing something as follows:
with open(filename) as file:
lines = file.read().splitlines()
d = dict()
for i in range(0, len(lines), 3):
d[lines[i].strip()] = (lines[i + 1], lines[i + 2])
Output:
{
'CALSPHERE 1': ('1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996', '2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319'),
'CALSPHERE 2': ('1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990', '2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421')
}

Assuming that your content is in file.txt you can use the following.
It shall work for any number of CALSPHERE keyword occurrences and also various number of entries between.
with open('file.txt') as inp:
buffer = []
for line in inp:
# remove newline
copy = line.replace('\n','')
# check if next entry
if 'CALSPHERE' in copy:
buffer.append([])
# add line
buffer[-1].append(copy)
# put the output into dictionary
res = {}
for chunk in buffer:
# safety check
if len(chunk) > 1:
res[chunk[0]] = tuple( chunk[1:] )
print(res)

Appending the length of sentences to file

I found the length and index and i want save all of them to new file:
example: index sentences length
my code
file = open("testing_for_tools.txt", "r")
lines_ = file.readlines()
for line in lines_:
lenght=len(line)-1
print(lenght)
for item in lines_:
print(lines_.index(item)+1,item)
output:
64
18
31
31
23
36
21
9
1
1 i went to city center, and i bought xbox5 , and some other stuff
2 i will go to gym !
3 tomorrow i, sill start my diet!
4 i achive some and i need more ?
5 i lost lots of weights؟
6 i have to , g,o home,, then sleep ؟
7 i have things to do )
8 i hope so
9 o
desired output and save to new file :
1 i went to city center, and i bought xbox5 , and some other stuff 64
2 i will go to gym ! 18

This can be achieved using the following code. Note the use of with ... as f which means we don't have to worry about closing the file after using it. In addition, I've used f-strings (requires Python 3.6), and enumerate to get the line number and concatenate everything into one string, which is written to the output file.
with open("test.txt", "r") as f:
lines_ = f.readlines()
with open("out.txt", "w") as f:
for i, line in enumerate(lines_, start=1):
line = line.strip()
f.write(f"{i} {line} {len(line)}\n")
Output:
1 i went to city center, and i bought xbox5 , and some other stuff 64
2 i will go to gym ! 18
If you wanted to sort the lines based on length, you could just put the following line after the first with block:
lines_.sort(key=len)
This would then give output:
1 i will go to gym ! 18
2 i went to city center, and i bought xbox5 , and some other stuff 64

Edit last element of each line of text file

I've got a text file with some elements as such;
0,The Hitchhiker's Guide to Python,Kenneth Reitz,2/4/2012,0
1,Harry Potter,JK Rowling,1/1/2010,8137
2,The Great Gatsby,F. Scott Fitzgerald,1/2/2010,0
3,To Kill a Mockingbird,Harper Lee,1/3/2010,1828
The last element of these lists determine which user has taken out the given book. If 0 then nobody has it.
I want a code to replace the ',0' of any given line into an input 4 digit number to show someone has taken out the book.
I've used .replace to change it from 1828 e.g. into 0 however I don't know how I can change the last element of specific line from 0 to something else.
I cannot use csv due to work/education restrictions so I have to leave the file in .txt format.
I also can only use Standard python library, therefore no pandas.

You can capture txt using read_csv into a dataframe:
import pandas as pd
df = pd.read_csv("text_file.txt", header=None)
df.columns = ["Serial", "Book", "Author", "Date", "Issued"]
print(df)
df.loc[3, "Issued"] = 0
print(df)
df.to_csv('text_file.txt', header=None, index=None, sep=',', mode='w+')
This replaces the third book issued count to 0.
Serial Book Author Date \
0 0 The Hitchhiker's Guide to Python Kenneth Reitz 2/4/2012
1 1 Harry Potter JK Rowling 1/1/2010
2 2 The Great Gatsby F. Scott Fitzgerald 1/2/2010
3 3 To Kill a Mockingbird Harper Lee 1/3/2010
Issued
0 0
1 8137
2 0
3 1828
Serial Book Author Date \
0 0 The Hitchhiker's Guide to Python Kenneth Reitz 2/4/2012
1 1 Harry Potter JK Rowling 1/1/2010
2 2 The Great Gatsby F. Scott Fitzgerald 1/2/2010
3 3 To Kill a Mockingbird Harper Lee 1/3/2010
Issued
0 0
1 8137
2 0
3 0
Edit after comment:
In case you only need to use python standard libraries, you can do something like this with file read:
import fileinput
i = 0
a = 5 # line to change with 1 being the first line in the file
b = '8371'
to_write = []
with open("text_file.txt", "r") as file:
for line in file:
i += 1
if (i == a):
print('line before')
print(line)
line = line[:line.rfind(',')] + ',' + b + '\n'
to_write.append(line)
print('line after edit')
print(line)
else:
to_write.append(line)
print(to_write)
with open("text_file.txt", "w") as f:
for line in to_write:
f.write(line)
File content
0,The Hitchhiker's Guide to Python,Kenneth Reitz,2/4/2012,0
1,Harry Potter,JK Rowling,1/1/2010,8137
2,The Great Gatsby,F. Scott Fitzgerald,1/2/2010,84
3,To Kill a Mockingbird,Harper Lee,1/3/2010,7895
4,XYZ,Harper,1/3/2018,258
5,PQR,Lee,1/3/2019,16
gives this as output
line before
4,XYZ,Harper,1/3/2018,258
line after edit
4,XYZ,Harper,1/3/2018,8371
["0,The Hitchhiker's Guide to Python,Kenneth Reitz,2/4/2012,0\n", '1,Harry Potter,JK Rowling,1/1/2010,8137\n', '2,The Great Gatsby,F. Scott Fitzgerald,1/2/2010,84\n', '3,To Kill a Mockingbird,Harper Lee,1/3/2010,7895\n', '4,XYZ,Harper,1/3/2018,8371\n', '5,PQR,Lee,1/3/2019,16\n', '\n']

you can try this:
to_write = []
with open('read.txt') as f: #read input file
for line in f:
if int(line[-2]) ==0:
to_write.append(line.replace(line[-2],'1234')) #any 4 digit number
else:
to_write.append(line)
with open('output.txt', 'w') as f: #name of output file
for _list in to_write:
f.write(_list)

Return the average mark for all student in that Section

I know it was asked already but the answers the super unclear
The first requirement is to open a file (sadly I have no idea how to do that)
The second requirement is a section of code that does the following:
Each line represents a single student and consists of a student number, a name, a section code and a midterm grade, all separated by whitespace
So I don't think i can target that element due to it being separate by whitespace?
Here is an excerpt of the file, showing line structure
987654322 Xu Carolyn L0101 19.5
233432555 Jones Billy Andrew L5101 16.0
555432345 Patel Amrit L0101 13.5
888332441 Fletcher Bobby L0201 18
777998713 Van Ryan Sarah Jane L5101 20
877633234 Zhang Peter L0102 9.5
543444555 Martin Joseph L0101 15
876543222 Abdolhosseini Mohammad Mazen L0102 18.5
I was provided the following hints:
Notice that the number of names per student varies.
Use rstrip() to get rid of extraneous whitespace at the end of the lines.
I don't understand the second hint.
This is what I have so far:
counter = 0
elements = -1
for sets in the_file
elements = elements + 1
if elements = 3
I know it has something to do with readlines() and the targeting the section code.

marks = [float(line.strip().split()[-1]) for line in open('path/to/input/file')]
average = sum(marks)/len(marks)
Hope this helps

Open and writing to files
strip method
Something like this?
data = {}
with open(filename) as f:#open a file
for line in f.readlines():#proceed through file lines
#next row is to split data using spaces and them skip empty using strip
stData = [x.strip() for x in line.split() if x.strip()]
#assign to variables
studentN, studentName, sectionCode, midtermGrade = stData
if sectionCode not in data:
data[sectionCode] = []
#building dict, key is a section code, value is a tuple with student info
data[sectionCode].append([studentN, studentName, float(midtermGrade)]
#make calculations
for k,v in data.iteritems():#iteritems returns you (key, value) pair on each iteration
print 'Section:' + k + ' Grade:' + str(sum(x[2] for x in v['grade']))

more or less:
infile = open('grade_file.txt', 'r')
score = 0
n = 0
for line in infile.readlines():
score += float(line.rstrip().split()[-1])
n += 1
avg = score / n

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning text file data after reading line by line without using pandas - python

with open('Arsenal.txt', 'r') as f: for line in f: if not line.startswith(('| --- |', '| Rnd |','| Totals |','| Averages |')) : line= line.strip() field= line.split('|') print(field) #furas this is my code I tried

Related

Reading CSV files and pulling specific data

How to parse a text file into a dictionary in python with key on one line followed by two lines of values

Appending the length of sentences to file

Edit last element of each line of text file

Return the average mark for all student in that Section

Categories

Resources