Merging inconsistent data in text files into a single excel spreadsheet - python

I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.
I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.
For instance, if I have two files that look like:
Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
and
LastName Name Age Year Color Size
Lily James 17 2021 green 0
How would I go about merging them like this in excel:
LastName Name Age Year Food Color Size
na Bob na 2018 Cake Blue na
na Charlie na 2017 Figs Red na
Lily James 17 2021 na green 0

Question: Merging inconsistent data in text files into a single excel spreadsheet
This solution is using the following build-in and moudules:
Set Types
Lists
CSV File Reading and Writing
Mapping Types — dict
The core of this solution is to normalize the columns names using a set() object and
the parameter .DictWriter(..., extrasaction='ignore') to handle the inconsistent columns.
The output format is CSV, which can be read from MS-Excel.
The given data, separated by blank
text1 = """Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
"""
text2 = """LastName Name Age Year Color Size
Lily James 17 2021 green 0
"""
Open three files an get the headers.
Aggregate all columns names, drop double names using a set().
Create a DictReader object for the in_* files.
Note: Replace io.StringIO(... with open(<Path to file>)
with io.StringIO(text1) as in_text1, \
io.StringIO(text2) as in_text2, \
io.StringIO() as out_csv:
columns = set()
reader = []
for n, fh in enumerate([in_text1, in_text2]):
fieldnames = fh.readline().rstrip().split()
[columns.add(name) for name in fieldnames]
reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))
Create a DictWriter object using the normalized column names.
The parameter extrasaction='ignore', handle the inconsistent columns.
Note: The column order is not guaranteed. If you need a defined order, sort the list(columns) to your needs before assigning to fieldnames=.
writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
writer.writeheader()
Loop all DictReader objects reading all lines and write it to the target .csv file.
for dictReader in reader:
for _dict in dictReader:
writer.writerow(_dict)
Output:
print(out_csv.getvalue())
Color,LastName,Year,Food,Age,Name,Size
Blue,,2018,Cake,,Bob,
Red,,2017,Figs,,Charlie,
green,Lily,2021,,17,James,0
Tested with Python: 3.4.2

If you were happy to work with the text files directly in Excel ... this will work but may need some refinement from yourself.
I understand it’s probably not what you’re looking for but it provides another option.
Open the Visual Basic editor, add a new module and copy the below code and paste in ...
Public Sub ReadAndMergeTextFiles()
Dim strSrcFolder As String, strFileName As String, strLine As String, strPath As String, bFirstLine As Boolean
Dim arrHeaders() As String, lngHeaderIndex As Long, arrFields, i As Long, objDestSheet As Worksheet, bFound As Boolean
Dim objLastHeader As Range, x As Long, lngLastColumn As Long, lngHeaderCol As Long, arrHeaderCols() As Long
Dim lngWriteRow As Long
lngLastColumn = 1
lngWriteRow = 2
Application.EnableEvents = False
Application.ScreenUpdating = False
' Change the sheet name being assigned to your destination worksheet name.
' Alternatively, display a prompt that asks for the sheet or simply uses the active sheet.
Set objDestSheet = Worksheets("Result")
With Application.FileDialog(msoFileDialogFolderPicker)
.Title = "Select Source Folder"
.Show
If .SelectedItems.Count = 1 Then
objDestSheet.Cells.Clear
strSrcFolder = .SelectedItems(1)
strFileName = Dir(strSrcFolder & "\*.txt")
Do While Len(strFileName) > 0
strPath = strSrcFolder & "\" & strFileName
Open strPath For Input As #1
bFirstLine = True
Do Until EOF(1)
Line Input #1, strLine
arrFields = Split(strLine, vbTab, , vbTextCompare)
lngHeaderIndex = -1
For i = 0 To UBound(arrFields)
If bFirstLine Then
' Loop through the header fields already written to the destination worksheet and find a match.
For x = 1 To objDestSheet.Columns.Count
bFound = False
If Trim(objDestSheet.Cells(1, x)) = "" Then Exit For
If UCase(objDestSheet.Cells(1, x)) = UCase(arrFields(i)) Then
lngHeaderCol = x
bFound = True
Exit For
End If
Next
If Not bFound Then
objDestSheet.Cells(1, lngLastColumn) = arrFields(i)
lngHeaderCol = lngLastColumn
lngLastColumn = lngLastColumn + 1
End If
lngHeaderIndex = lngHeaderIndex + 1
ReDim Preserve arrHeaderCols(lngHeaderIndex)
arrHeaderCols(lngHeaderIndex) = lngHeaderCol
Else
' Write out each value into the column found.
objDestSheet.Cells(lngWriteRow, arrHeaderCols(i)) = "'" & arrFields(i)
End If
Next
If Not bFirstLine Then
lngWriteRow = lngWriteRow + 1
End If
bFirstLine = False
Loop
Close #1
strFileName = Dir
Loop
objDestSheet.Columns.AutoFit
End If
End With
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
... I did some basic testing with the data you provided and it seemed to work. If for some reason it fails over the data you're using and you can't work it out, let me know and I'll put a fix in.
Some points ...
The order of the columns depends on the order of your files and which columns appear first. Of course, that could be enhanced upon but it is what it is for now.
It assumes all files in the one folder and all files end in .txt
The separator within each file is assumed to be a TAB.
Let me know if that helps.

Related

Data Scraping from txt file with consistent structure

I'm working with a very old program that outputs the results for a batch query in a very odd format (at least for me).
Imagine having queried info for the objects A, B and C.
The output will look like this:
name : A
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : B
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : C
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
Do you have any idea of how to put the data in a more useful format?
A possible good format would be a table with columns A B C and rows p1, p2...
I had few ideas but I don't really know how to implement those:
Every object is separated by a ====== string, that means i can use this to separate in many .txt files the output
Then I can read the files with excel setting : as separator, obtaining a csv file with 2 columns (1 containing the p descriptors and one with the actual values)
Now i need to merge all the csv files into one single csv with as many columns as objects and px rows
I'd like to do this in python but I really don't know any package for this situation. Also the objects are a few hundreds so I need an automatized algorithm for doing that.
Any tip, advice or idea you can think of is welcome.
Here's a quick solution putting the data you say you need - not all labels - in a csv file. Each output line starts with the name A/B/C and then comes the values p1..x.
It has no handling of missing values, so in that case just the present values will be listed, thus column 5 will not always be p4. It relies on the assumption that there's a name line starting every item/entry, and that all other a:b lines have a value b to be stored. This should be a good start to put it into another structure should you need so. The format is truly special, more of a report structure, so I'd guess there's no suitable general purpose lib. Flat format is another similarly tricky old format type for which there are libraries - I've used it when calculating how much money each swedish participator in the interrail program should receive. Tricky business but fun! :-)
The code:
import re
import csv
with open('input.txt') as f:
lines = f.readlines()
f.close()
entries = []
entry = []
for line in lines:
parts = re.split(r':', line)
if len(parts) >= 2:
label = parts[0]
value = parts[1].strip()
if label.startswith('name'):
print('got name: ' + value)
# start new entry with the name as first value
entry = [value]
entries.append(entry)
else:
print('got value: ' + value)
entry.append(value)
print('collected {} entries'.format(len(entries)))
with open('output.csv', 'w', newline='') as output:
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
wr.writerows(entries)

How to detect "strikethrough" style from xlsx file in R

I have to check the data which contain "strikethrough" format when importing excel file in R
Do we have any method to detect them ?
Welcome for both R and Python approach
R-solution
the tidyxl-package can help you...
example temp.xlsx, with data on A1:A4 of the first sheet. Below is an excel-screenshot:
library(tidyxl)
formats <- xlsx_formats( "temp.xlsx" )
cells <- xlsx_cells( "temp.xlsx" )
strike <- which( formats$local$font$strike )
cells[ cells$local_format_id %in% strike, 2 ]
# A tibble: 2 x 1
# address
# <chr>
# 1 A2
# 2 A4
I present below a small sample program that filters out text with strikethrough applied, using the openpyxl package (I tested it on version 2.5.6 with Python 3.7.0). Sorry it took so long to get back to you.
import openpyxl as opx
from openpyxl.styles import Font
def ignore_strikethrough(cell):
if cell.font.strike:
return False
else:
return True
wb = opx.load_workbook('test.xlsx')
ws = wb.active
colA = ws['A']
fColA = filter(ignore_strikethrough, colA)
for i in fColA:
print("Cell {0}{1} has value {2}".format(i.column, i.row, i.value))
print(i.col_idx)
I tested it on a new workbook with the default worksheets, with the letters a,b,c,d,e in the first five rows of column A, where I had applied strikethrough formatting to b and d. This program filters out the cells in columnA which have had strikethrough applied to the font, and then prints the cell, row and values of the remaining ones. The col_idx property returns the 1-based numeric column value.
I found a method below:
'# Assuming the column from 1 - 10 has value : A , the 5th A contains "strikethrough"
TEST_wb = load_workbook(filename = 'TEST.xlsx')
TEST_wb_s = TEST_wb.active
for i in range(1, TEST_wb_s.max_row+1):
ck_range_A = TEST_wb_s['A'+str(i)]
if ck_range_A.font.strikethrough == True:
print('YES')
else:
print('NO')
But it doesn't tell the location (this case is the row numbers),which is hard for knowing where contains "strikethrough" when there is a lot of result , how can i vectorize the result of statement ?

Exporting max values of different csv files in to one

I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)
Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))
You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))

Matching parts of two csv files to return certain elements

Hello I am looking for some help to do like an index match in excel i am very new to python but my data sets are far to large for excel now
I will dumb my question right down as much as possible cause the data contains alot of irrelevant information to this problem
CSV A (has 3 Basic columns)
Name, Date, Value
CSV B (has 2 columns)
Value, Score
CSV C (I want to create this using python; 2 columns)
Name, Score
All I want to do is enter a date and have it look up all rows in CSV A which match that "date" and then look up the "score" associated to the "value" from that row in CSV A in CSV B and returning it in CSV C along with the name of the person. Rinse and repeat through every row
Any help is much appreciated I don't seem to be getting very far
Here is a working script using Python's csv module:
It prompts the user to input a date (format is m-d-yy), then reads csvA row by row to check if the date in each row matches the inputted date.
If yes, it checks if the value that corresponds the date from the current row of A matches any of the rows in csvB.
If there are matches, it will write the name from csvA and the score from csvB to csvC.
import csv
date = input('Enter date: ').strip()
A = csv.reader( open('csvA.csv', newline=''), delimiter=',')
matches = 0
# reads each row of csvA
for row_of_A in A:
# removes whitespace before and after of each string in each row of csvA
row_of_A = [string.strip() for string in row_of_A]
# if date of row in csvA has equal value to the inputted date
if row_of_A[1] == date:
B = csv.reader( open('csvB.csv', newline=''), delimiter=',')
# reads each row of csvB
for row_of_B in B:
# removes whitespace before and after of each string in each row of csvB
row_of_B = [string.strip() for string in row_of_B]
# if value of row in csvA is equal to the value of row in csvB
if row_of_A[2] == row_of_B[0]:
# if csvC.csv does not exist
try:
open('csvC.csv', 'r')
except:
C = open('csvC.csv', 'a')
print('Name,', 'Score', file=C)
C = open('csvC.csv', 'a')
# writes name from csvA and value from csvB to csvC
print(row_of_A[0] + ', ' + row_of_B[1], file=C)
m = 'matches' if matches > 1 else 'match'
print('Found', matches, m)
Sample csv files:
csvA.csv
Name, Date, Value
John, 2-6-15, 10
Ray, 3-5-14, 25
Khay, 4-4-12, 30
Jake, 2-6-15, 100
csvB.csv
Value, Score
10, 500
25, 200
30, 300
100, 250
Sample Run:
>>> Enter date: 2-6-15
Found 2 matches
csvC.csv (generated by script)
Name, Score
John, 500
Jake, 250
if you are using unix you can do this by below shell script
also I am assuming that you are appending the search output in file_C and there are no duplicated in both source files
while true
do
echo "enter date ..."
read date
value_one=grep $date file_A | cut -d',' -f1
tmp1=grep $date' file_A | cut -d',' -f3
value_two=grep $tmp1 file_B | cut -d',' -f2
echo "${value_one},${value_two}" >> file_c
echo "want to search more dates ... press y|Y, press any other key to exit"
read ch
if [ "$ch" = "y" ] || [ "$ch" = "y" ]
then
continue
else
exit
fi
done

Exact match in Python CSV row and column

I looked around for a while and didn't find anything that matched what I was doing.
I have this code:
import csv
import datetime
legdistrict = []
reader = csv.DictReader(open('active.txt', 'rb'), delimiter='\t')
for row in reader:
if '27' in row['LegislativeDistrict']:
legdistrict.append(row)
ages = []
for i,value in enumerate(legdistrict):
dates = datetime.datetime.now() - datetime.datetime.strptime(value['Birthdate'], '%m/%d/%Y')
ages.append(int(datetime.timedelta.total_seconds(dates) / 31556952))
total_values = len(ages)
total = sum(ages) / total_values
print total_values
print sum(ages)
print total
which searches a tab-delimited text file and finds the rows in the column named LegislativeDistrict that contain the string 27. (So, finding all rows that are in the 27th LD.) It works well, but I run into issues if the string is a single digit number.
When I run the code with 27, I get this result:
0 ;) eric#crunchbang ~/sbdmn/May 2014 $ python data.py
74741
3613841
48
Which means there are 74,741 values that contain 27, with combined ages of 3,613,841, and an average age of 48.
But when I run the code with 4 I get this result:
0 ;) eric#crunchbang ~/sbdmn/May 2014 $ python data.py
1177818
58234407
49
The first result (1,177,818) is much too large. There are no LDs in my state over 170,000 people, and my lists deal with voters only.
Because of this, I'm assuming using 4 is finding all the values that have 4 in them... so 14, 41, and 24 would all be used thus causing the huge number.
Is there a way I can search for a value in a specific column and use a regex or exact search? Regex works, but I can't get it to search just one column -- it searches the entire text file.
My data looks like this:
StateVoterID CountyVoterID Title FName MName LName NameSuffix Birthdate Gender RegStNum RegStFrac RegStName RegStType RegUnitType RegStPreDirection RegStPostDirection RegUnitNum RegCity RegState RegZipCode CountyCode PrecinctCode PrecinctPart LegislativeDistrict CongressionalDistrict Mail1 Mail2 Mail3 Mail4 MailCity MailZip MailState MailCountry Registrationdate AbsenteeType LastVoted StatusCode
IDNUMBER OTHERIDNUMBER NAME MI 01/01/1900 M 123 FIRST ST W CITY STATE ZIP MM 123 4 AGE 5 01/01/1950 N 01/01/2000 B
'4' in '400' will return True as in does a substring check. Use instead '4' == '400', which only will return True if the two strings are identical:
if '4' == row['LegislativeDistrict']:
(...)

Categories

Resources