Data Scraping from txt file with consistent structure

Data Scraping from txt file with consistent structure - python

I'm working with a very old program that outputs the results for a batch query in a very odd format (at least for me).
Imagine having queried info for the objects A, B and C.
The output will look like this:
name : A
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : B
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : C
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
Do you have any idea of how to put the data in a more useful format?
A possible good format would be a table with columns A B C and rows p1, p2...
I had few ideas but I don't really know how to implement those:
Every object is separated by a ====== string, that means i can use this to separate in many .txt files the output
Then I can read the files with excel setting : as separator, obtaining a csv file with 2 columns (1 containing the p descriptors and one with the actual values)
Now i need to merge all the csv files into one single csv with as many columns as objects and px rows
I'd like to do this in python but I really don't know any package for this situation. Also the objects are a few hundreds so I need an automatized algorithm for doing that.
Any tip, advice or idea you can think of is welcome.

Here's a quick solution putting the data you say you need - not all labels - in a csv file. Each output line starts with the name A/B/C and then comes the values p1..x.
It has no handling of missing values, so in that case just the present values will be listed, thus column 5 will not always be p4. It relies on the assumption that there's a name line starting every item/entry, and that all other a:b lines have a value b to be stored. This should be a good start to put it into another structure should you need so. The format is truly special, more of a report structure, so I'd guess there's no suitable general purpose lib. Flat format is another similarly tricky old format type for which there are libraries - I've used it when calculating how much money each swedish participator in the interrail program should receive. Tricky business but fun! :-)
The code:
import re
import csv
with open('input.txt') as f:
lines = f.readlines()
f.close()
entries = []
entry = []
for line in lines:
parts = re.split(r':', line)
if len(parts) >= 2:
label = parts[0]
value = parts[1].strip()
if label.startswith('name'):
print('got name: ' + value)
# start new entry with the name as first value
entry = [value]
entries.append(entry)
else:
print('got value: ' + value)
entry.append(value)
print('collected {} entries'.format(len(entries)))
with open('output.csv', 'w', newline='') as output:
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
wr.writerows(entries)

Related

Python- replace last n chars of a specific section of a specific row found in a text file

I have 1000s of text files where I want to replace a very specific section of text with a predefined string. These files contain data like this:
Type Basemap 20221118202211
QSNGAGL1 20221120209912300111111 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1290BOB2044911451145B T1
QI1200BOB2014411451145B T1
QI1200BOB2014611451145B T1
QT1200DOY385621145 T1
QSNGAGL2 20221120209912300100110 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1200DOY2932411451145B T1
QI1200DOA2517511451145B T1
QT1200DOY385621145 T1
QSNFB 1 20221009209912300101100 1 Bus O
QO1290BOB203871115 T1
QI1290BOA0587611151115B T1
QI1290BOB2044911151115B T1
#(and so on... for ~60,000 rows per file...)
The first row is a header which only appears once per file. The spacing in the data is not consistent. The number of 'non-QS*' rows between each 'QS*' row varies.
I want to be able to:
iterate through each file
find each row starting with 'QS'
find the 2nd section of text in this row (the number usually starting 2022... This is a date range, with 7 numbers on the end representing each 7 days of the week with a 1 or a 0)
replace these last 7 characters of this section with specific text ('1111100')
save this as a new file with the prefix 'fixed_' on the file name (as to not overwrite original file)
I've thought about exploring pandas but I can't get it to read the data correctly. It doesn't help that on row 55,000 and on (in some files), there appears to be another column of data where a text string has spilled over to the right of its row. I also can't use a simple find and replace as these last 7 values could be any combination of 1s and 0s.
Using the second 'QS' row from the example above, I'd want '20221120209912300100110' changed to '20221120209912301111100'. Note how the last 7 characters are the '1111100' I desire.
UPDATE: I've changed the sample text above to include a differently laid out 'QS*' rows which can occur.

Try (regex demo):
import re
pat = re.compile(r"(^\s*QS\S+\s*)(\d+?)\d{7}\b")
with open("input.txt", "r") as f_in, open("fixed_output.txt", "w") as f_out:
for line in f_in:
line = pat.sub(r"\g<1>\g<2>1111100", line)
f_out.write(line)
If input.txt contains the text in the question then fixed_output.txt will contain:
Type Basemap 20221118202211
QSNGAGL1 20221120209912301111100 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1290BOB2044911451145B T1
QI1200BOB2014411451145B T1
QI1200BOB2014611451145B T1
QT1200DOY385621145 T1
QSNGAGL2 20221120209912301111100 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1200DOY2932411451145B T1
QI1200DOA2517511451145B T1
QT1200DOY385621145 T1
QSNGAGL3 20221120209912301111100 1B Bus O
QO1290BOB203871115 T1
QI1290BOA0587611151115B T1
QI1290BOB2044911151115B T1

Split txt file into tables - input as non uniform format

I'm trying to read a file into tables I can work with. I have one input file that contains 4 tables with coefficients. Each table begins with a line which describes its contents. Each table contains 25 numbers for each latitude, from -85 to 85, and each month. I would like to split the input file into a matrix like TAB(4,12,18,25) - 4 tables, 12 months, 18 latitudes and 25 levels. It's pretty messy as I don't have fixed separator - sometimes I could use space but then later on there are negative values and the space is used for that.
O2 CLIMATOLOGY : k*=11, 12, ..., 35
JAN -85 O2 cli 2.452E-07-8.040E-07 8.850E-07 7.970E-07 7.875E-06 8.494E-06\n
5.082E-06 4.159E-06-5.252E-06 5.892E-06 7.188E-06-7.641E-06 5.082E-06 5.350E-06\n
5.380E-06 5.079E-06 4.229E-06-3.367E-06-2.600E-06 2.043E-06-1.706E-06 7.413E-06\n
1.158E-06 9.480E-07 7.570E-07\n
JAN -75 O2 cli 2.300E-07 3.020E-07 4.760E-07 9.210E-07 1.729E-06 2.486E-06\n
3.163E-06 3.668E-06 3.838E-06 3.993E-06 4.401E-06 4.911E-06 5.304E-06 5.506E-06\n
.
.
.
TEMPERATURE CLIMATOLOGY : Z*=11, 12, ..., 35
JAN -85 T clim 2.278E+02 2.303E+02 2.323E+02 2.334E+02 2.340E+02 2.344E+02\n
It is a fortran model output. I tried readlines and split. In tables with " " delimiter it worked well but in the other where the space is taken for the minus character, it is not working.
I am not used to work with such a data and have no more idea how to proceed.

It looks to me like you have a fixed-length format for your data.
Each section of your file is identified with a title. Within a section, you have a number of subsections. Each subsection conforms to the following format:
Seven characters for your month: JAN, FEB, JUL, etc.
Five characters for your latitude
I don't know what this third field is, or if you want it, but it requires 8 characters.
Each of your 25 values is given a width of 10 characters.
Before we construct an algorithm, let's create a class to hold our data. You might want to use a pandas dataframe for this but I'll use standard Python because that's easier to work with for people who don't already know pandas:
class TableData:
def __init__(self, name, month, latitude, levels):
self.name = name
self.month = month
self.latitude = latitude
self.levels = levels
From this we can construct an algorithm:
# First, setup some constants for the data file
num_tables = 4
num_months = 12
num_latitudes = 18
num_levels = 25
month_stop = 7
latitude_stop = 12
col3_stop = 20
level_width = 10
table_size = num_months * num_latitudes
# Next, read in the raw data, separated by newlines
lines = []
with open('data', 'r') as f:
lines = f.readlines()
# Remove lines with no data in them
lines = [l in lines if l]
# Now, iterate over the entire file and convert it to the proper format
tables = {}
index = 0
for table in range(4):
# First, get the name of the table
name = lines[index]
# Next, iterate over all the data in the table
data = []
for i in range(index + 1, 4 * table_size, 4):
# First, get the month associated with the data
month = lines[i][:3]
# Next, get the latitude associated with the data
latitude = int(lines[i][3:8])
# Now, get the levels by splitting the remainder of this line, and
# each of the following three lines, into chunks equivalent to the
# width of the level column and convert them to a list of floating-
# point numbers and combine all the lists into a single list
levels = []
map(levels.extend, [get_levels(lines[i + j], col3_stop if j == 0 else 0) for j in range(4)])
# Finally, create an instance of our table data and add it to the
# list of data for the table
data.append(TableData(name, month, latitude, levels))
# Finally, associate the data with the name and continue on to the next
# table and update the index
tables[name] = data
index += 4 * table_size
# Helper function to extract levels data from a single line of the table
def get_levels(line, start):
return [float(line[k:k + level_width]) for k in range(start, len(line), level_width)]
This algorithm is pretty brittle as it relies on your file being structured exactly as you described, but it'll do the job if that's the case.

Merging inconsistent data in text files into a single excel spreadsheet

I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.
I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.
For instance, if I have two files that look like:
Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
and
LastName Name Age Year Color Size
Lily James 17 2021 green 0
How would I go about merging them like this in excel:
LastName Name Age Year Food Color Size
na Bob na 2018 Cake Blue na
na Charlie na 2017 Figs Red na
Lily James 17 2021 na green 0

Question: Merging inconsistent data in text files into a single excel spreadsheet
This solution is using the following build-in and moudules:
Set Types
Lists
CSV File Reading and Writing
Mapping Types — dict
The core of this solution is to normalize the columns names using a set() object and
the parameter .DictWriter(..., extrasaction='ignore') to handle the inconsistent columns.
The output format is CSV, which can be read from MS-Excel.
The given data, separated by blank
text1 = """Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
"""
text2 = """LastName Name Age Year Color Size
Lily James 17 2021 green 0
"""
Open three files an get the headers.
Aggregate all columns names, drop double names using a set().
Create a DictReader object for the in_* files.
Note: Replace io.StringIO(... with open(<Path to file>)
with io.StringIO(text1) as in_text1, \
io.StringIO(text2) as in_text2, \
io.StringIO() as out_csv:
columns = set()
reader = []
for n, fh in enumerate([in_text1, in_text2]):
fieldnames = fh.readline().rstrip().split()
[columns.add(name) for name in fieldnames]
reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))
Create a DictWriter object using the normalized column names.
The parameter extrasaction='ignore', handle the inconsistent columns.
Note: The column order is not guaranteed. If you need a defined order, sort the list(columns) to your needs before assigning to fieldnames=.
writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
writer.writeheader()
Loop all DictReader objects reading all lines and write it to the target .csv file.
for dictReader in reader:
for _dict in dictReader:
writer.writerow(_dict)
Output:
print(out_csv.getvalue())
Color,LastName,Year,Food,Age,Name,Size
Blue,,2018,Cake,,Bob,
Red,,2017,Figs,,Charlie,
green,Lily,2021,,17,James,0
Tested with Python: 3.4.2

If you were happy to work with the text files directly in Excel ... this will work but may need some refinement from yourself.
I understand it’s probably not what you’re looking for but it provides another option.
Open the Visual Basic editor, add a new module and copy the below code and paste in ...
Public Sub ReadAndMergeTextFiles()
Dim strSrcFolder As String, strFileName As String, strLine As String, strPath As String, bFirstLine As Boolean
Dim arrHeaders() As String, lngHeaderIndex As Long, arrFields, i As Long, objDestSheet As Worksheet, bFound As Boolean
Dim objLastHeader As Range, x As Long, lngLastColumn As Long, lngHeaderCol As Long, arrHeaderCols() As Long
Dim lngWriteRow As Long
lngLastColumn = 1
lngWriteRow = 2
Application.EnableEvents = False
Application.ScreenUpdating = False
' Change the sheet name being assigned to your destination worksheet name.
' Alternatively, display a prompt that asks for the sheet or simply uses the active sheet.
Set objDestSheet = Worksheets("Result")
With Application.FileDialog(msoFileDialogFolderPicker)
.Title = "Select Source Folder"
.Show
If .SelectedItems.Count = 1 Then
objDestSheet.Cells.Clear
strSrcFolder = .SelectedItems(1)
strFileName = Dir(strSrcFolder & "\*.txt")
Do While Len(strFileName) > 0
strPath = strSrcFolder & "\" & strFileName
Open strPath For Input As #1
bFirstLine = True
Do Until EOF(1)
Line Input #1, strLine
arrFields = Split(strLine, vbTab, , vbTextCompare)
lngHeaderIndex = -1
For i = 0 To UBound(arrFields)
If bFirstLine Then
' Loop through the header fields already written to the destination worksheet and find a match.
For x = 1 To objDestSheet.Columns.Count
bFound = False
If Trim(objDestSheet.Cells(1, x)) = "" Then Exit For
If UCase(objDestSheet.Cells(1, x)) = UCase(arrFields(i)) Then
lngHeaderCol = x
bFound = True
Exit For
End If
Next
If Not bFound Then
objDestSheet.Cells(1, lngLastColumn) = arrFields(i)
lngHeaderCol = lngLastColumn
lngLastColumn = lngLastColumn + 1
End If
lngHeaderIndex = lngHeaderIndex + 1
ReDim Preserve arrHeaderCols(lngHeaderIndex)
arrHeaderCols(lngHeaderIndex) = lngHeaderCol
Else
' Write out each value into the column found.
objDestSheet.Cells(lngWriteRow, arrHeaderCols(i)) = "'" & arrFields(i)
End If
Next
If Not bFirstLine Then
lngWriteRow = lngWriteRow + 1
End If
bFirstLine = False
Loop
Close #1
strFileName = Dir
Loop
objDestSheet.Columns.AutoFit
End If
End With
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
... I did some basic testing with the data you provided and it seemed to work. If for some reason it fails over the data you're using and you can't work it out, let me know and I'll put a fix in.
Some points ...
The order of the columns depends on the order of your files and which columns appear first. Of course, that could be enhanced upon but it is what it is for now.
It assumes all files in the one folder and all files end in .txt
The separator within each file is assumed to be a TAB.
Let me know if that helps.

Python - average of unique values

I have a CSV file that looks like this:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
...
0101,41.0
0102,39.9
0103,44.6
0104,42.0
0105,43.0
0106,42.4
It's a list of temperatures for specific dates. It contains data for several years so the same dates occur multiple times. I would like to average the temperature so that I get a new table where each date is only occurring once and has the average temperature for that date in the second column.
I know that Stack Overflow requires you to include what you've attempted, but I really don't know how to do this and couldn't find any other answers on this.
I hope someone can help. Any help is much appreciated.

You can use pandas, and run the groupby command, when df is your data frame:
df.groupby('DATE').mean()
Here is some toy example to depict the behaviour
import pandas as pd
df=pd.DataFrame({"a":[1,2,3,1,2,3],"b":[1,2,3,4,5,6]})
df.groupby('a').mean()
Will result in
a b
1 2.5
2 3.5
3 4.5
When the original dataframe was
a b
0 1 1
1 2 2
2 3 3
3 1 4
4 2 5
5 3 6

If you can use the defaultdict pacakge from collections, makes this type of thing pretty easy.
Assuming your list is in the same directory as the python script and it looks like this:
list.csv:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
Here is the code I used to print out the averages.
#test.py
#usage: python test.py list.csv
import sys
from collections import defaultdict
#Open a file who is listed in the command line in the second position
with open(sys.argv[1]) as File:
#Skip the first line of the file, if its just "data,value"
File.next()
#Create a dictionary of lists
ourDict = defaultdict(list)
#parse the file, line by line
for each in File:
# Split the file, by a comma,
#or whatever separates them (Comma Seperated Values = CSV)
each = each.split(',')
# now each[0] is a year, and each[1] is a value.
# We use each[0] as the key, and append vallues to the list
ourDict[each[0]].append(float(each[1]))
print "Date\tValue"
for key,value in ourDict.items():
# Average is the sum of the value of all members of the list
# divided by the list's length
print key,'\t',sum(value)/len(value)

Exact match in Python CSV row and column

I looked around for a while and didn't find anything that matched what I was doing.
I have this code:
import csv
import datetime
legdistrict = []
reader = csv.DictReader(open('active.txt', 'rb'), delimiter='\t')
for row in reader:
if '27' in row['LegislativeDistrict']:
legdistrict.append(row)
ages = []
for i,value in enumerate(legdistrict):
dates = datetime.datetime.now() - datetime.datetime.strptime(value['Birthdate'], '%m/%d/%Y')
ages.append(int(datetime.timedelta.total_seconds(dates) / 31556952))
total_values = len(ages)
total = sum(ages) / total_values
print total_values
print sum(ages)
print total
which searches a tab-delimited text file and finds the rows in the column named LegislativeDistrict that contain the string 27. (So, finding all rows that are in the 27th LD.) It works well, but I run into issues if the string is a single digit number.
When I run the code with 27, I get this result:
0 ;) eric#crunchbang ~/sbdmn/May 2014 $ python data.py
74741
3613841
48
Which means there are 74,741 values that contain 27, with combined ages of 3,613,841, and an average age of 48.
But when I run the code with 4 I get this result:
0 ;) eric#crunchbang ~/sbdmn/May 2014 $ python data.py
1177818
58234407
49
The first result (1,177,818) is much too large. There are no LDs in my state over 170,000 people, and my lists deal with voters only.
Because of this, I'm assuming using 4 is finding all the values that have 4 in them... so 14, 41, and 24 would all be used thus causing the huge number.
Is there a way I can search for a value in a specific column and use a regex or exact search? Regex works, but I can't get it to search just one column -- it searches the entire text file.
My data looks like this:
StateVoterID CountyVoterID Title FName MName LName NameSuffix Birthdate Gender RegStNum RegStFrac RegStName RegStType RegUnitType RegStPreDirection RegStPostDirection RegUnitNum RegCity RegState RegZipCode CountyCode PrecinctCode PrecinctPart LegislativeDistrict CongressionalDistrict Mail1 Mail2 Mail3 Mail4 MailCity MailZip MailState MailCountry Registrationdate AbsenteeType LastVoted StatusCode
IDNUMBER OTHERIDNUMBER NAME MI 01/01/1900 M 123 FIRST ST W CITY STATE ZIP MM 123 4 AGE 5 01/01/1950 N 01/01/2000 B

'4' in '400' will return True as in does a substring check. Use instead '4' == '400', which only will return True if the two strings are identical:
if '4' == row['LegislativeDistrict']:
(...)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.