Related
The input is many JSON files differing in structure, and the desired output is a single dataframe.
Input Description:
Each JSON file may have 1 or many attackers and exactly 1 victim. The attackers key points to a list of dictionaries. Each dictionary is 1 attacker with keys such as character_id, corporation_id, alliance_id, etc. The victim key points to dictionary with similar keys. Important thing to note here is that the keys might differ between the same JSON. For example, a JSON file may have attackers key which looks like this:
{
"attackers": [
{
"alliance_id": 99005678,
"character_id": 94336577,
"corporation_id": 98224639,
"damage_done": 3141,
"faction_id": 500003,
"final_blow": true,
"security_status": -9.4,
"ship_type_id": 73796,
"weapon_type_id": 3178
},
{
"damage_done": 1614,
"faction_id": 500003,
"final_blow": false,
"security_status": 0,
"ship_type_id": 32963
}
],
...
Here the JSON file has 2 attackers. But only the first attacker has the afore-mentioned keys. Similarly, the victim may look like this:
...
"victim": {
"character_id": 2119076173,
"corporation_id": 98725195,
"damage_taken": 4755,
"faction_id": 500002,
"items": [...
...
Output Description:
As an output I want to create a dataframe from many (about 400,000) such JSON files stored in the same directory. Each row of the resulting dataframe should have 1 attacker and 1 victim. JSONs with multiple attackers should be split into equal number of rows, where the attackers' properties are different, but the victim properties are the same. For e.g., 3 rows if there are 3 attackers and NaN values where a certain attacker doesn't have a key-value pair. So, the character_id for the second attacker in the dataframe of the above example should be NaN.
Current Method:
To achieve this, I first create an empty list. Then iterate through all the files, open them, load them as JSON objects, convert to dataframe then append dataframe to the list. Please note that pd.DataFrame([json.load(fi)]) has the same output as pd.json_normalize(json.load(fi)).
mainframe = []
for file in tqdm(os.listdir("D:/Master/killmails_jul"), ncols=100, ascii=' >'):
with open("%s/%s" % ("D:/Master/killmails_jul", file),'r') as fi:
mainframe.append(pd.DataFrame([json.load(fi)]))
After this loop, I am left with a list of dataframes which I concatenate using pd.concat().
mainframe = pd.concat(mainframe)
As of yet, the dataframe only has 1 row per JSON irrespective of the number of attackers. To fix this, I use pd.explode() in the next step.
mainframe = mainframe.explode('attackers')
mainframe.reset_index(drop=True, inplace=True)
Now I have separate rows for each attacker, however the attackers & victim keys are still hidden in their respective column. To fix this I 'explode' the the two columns horizontally by pd.apply(pd.Series) and apply prefix for easy recognition as follows:
intframe = mainframe["attackers"].apply(pd.Series).add_prefix("attackers_").join(mainframe["victim"].apply(pd.Series).add_prefix("victim_"))
In the next step I join this intermediate frame with the mainframe to retain the killmail_id and killmail_hash columns. Then remove the attackers & victim columns as I have now expanded them.
mainframe = intframe.join(mainframe)
mainframe.fillna(0, inplace=True)
mainframe.drop(['attackers','victim'], axis=1, inplace=True)
This gives me the desired output with the following 24 columns:
['attackers_character_id', 'attackers_corporation_id', 'attackers_damage_done', 'attackers_final_blow', 'attackers_security_status', 'attackers_ship_type_id', 'attackers_weapon_type_id', 'attackers_faction_id', 'attackers_alliance_id', 'victim_character_id', 'victim_corporation_id', 'victim_damage_taken', 'victim_items', 'victim_position', 'victim_ship_type_id', 'victim_alliance_id', 'victim_faction_id', 'killmail_id', 'killmail_time', 'solar_system_id', 'killmail_hash', 'http_last_modified', 'war_id', 'moon_id']
Question:
Is there a better way to do this than I am doing right now? I tried to use generators but couldn't get them to work. I get an AttributeError: 'str' object has no attribute 'read'
all_files_paths = glob(os.path.join('D:\\Master\\kmrest', '*.json'))
def gen_df(files):
for file in files:
with open(file, 'r'):
data = json.load(file)
data = pd.DataFrame([data])
yield data
mainframe = pd.concat(gen_df(all_files_paths), ignore_index=True)
Will using the pd.concat() function with generators lead to quadratic copying?
Also, I am worried opening and closing many files is slowing down computation. Maybe it would be better to create a JSONL file from all the JSONs first and then creating a dataframe for each line.
If you'd like to get your hands on the files, I am trying to work with you can click here. Let me know if further information is needed.
You could use pd.json_normalize() to help with the heavy lifting:
First, load your data:
import json
import requests
import tarfile
from tqdm.notebook import tqdm
url = 'https://data.everef.net/killmails/2022/killmails-2022-11-22.tar.bz2'
with requests.get(url, stream=True) as r:
fobj = io.BytesIO(r.raw.read())
with tarfile.open(fileobj=fobj, mode='r:bz2') as tar:
json_files = [it for it in tar if it.name.endswith('.json')]
data = [json.load(tar.extractfile(it)) for it in tqdm(json_files)]
To do the same with your files:
import json
from glob import glob
def json_load(filename):
with open(filename) as f:
return json.load(f)
topdir = '...' # the dir containing all your json files
data = [json_load(fn) for fn in tqdm(glob(f'{topdir}/*.json'))]
Once you have a list of dicts in data:
others = ['killmail_id', 'killmail_hash']
a = pd.json_normalize(data, 'attackers', others, record_prefix='attackers.')
v = pd.json_normalize(data).drop('attackers', axis=1)
df = a.merge(v, on=others)
Some quick inspection:
>>> df.shape
(44903, 26)
# check:
>>> sum([len(d['attackers']) for d in data])
44903
>>> df.columns
Index(['attackers.alliance_id', 'attackers.character_id',
'attackers.corporation_id', 'attackers.damage_done',
'attackers.final_blow', 'attackers.security_status',
'attackers.ship_type_id', 'attackers.weapon_type_id',
'attackers.faction_id', 'killmail_id', 'killmail_hash', 'killmail_time',
'solar_system_id', 'http_last_modified', 'victim.alliance_id',
'victim.character_id', 'victim.corporation_id', 'victim.damage_taken',
'victim.items', 'victim.position.x', 'victim.position.y',
'victim.position.z', 'victim.ship_type_id', 'victim.faction_id',
'war_id', 'moon_id'],
dtype='object')
>>> df.iloc[:5, :5]
attackers.alliance_id attackers.character_id attackers.corporation_id attackers.damage_done attackers.final_blow
0 99007887.0 1.450608e+09 2.932806e+08 1426 False
1 99010931.0 1.628193e+09 5.668252e+08 1053 False
2 99007887.0 1.841341e+09 1.552312e+09 1048 False
3 99007887.0 2.118406e+09 9.872458e+07 662 False
4 99005839.0 9.573650e+07 9.947834e+08 630 False
>>> df.iloc[-5:, -5:]
victim.position.z victim.ship_type_id victim.faction_id war_id moon_id
44898 1.558110e+11 670 NaN NaN NaN
44899 -7.678686e+10 670 NaN NaN NaN
44900 -7.678686e+10 670 NaN NaN NaN
44901 -7.678686e+10 670 NaN NaN NaN
44902 -7.678686e+10 670 NaN NaN NaN
Note also that, as desired, missing keys for attackers are NaN:
>>> df.iloc[15:20, :2]
attackers.alliance_id attackers.character_id
15 99007887.0 2.117497e+09
16 99011893.0 1.593514e+09
17 NaN 9.175132e+07
18 NaN 2.119191e+09
19 99011258.0 1.258332e+09
I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers
I am trying to import data from a .txt file that contains four columns that are separated by tab and is several thousands lines long. This is how the start of the document look like:
Data info
File name: D:\(path to file)
Start time: 6/26/2019 15:39:54.222
Number of channels: 3
Sample rate: 1E6
Store type: fast on trigger
Post time: 20
Global header information: from DEWESoft
Comments:
Events
Event Type Event Time Comment
1 storing started at 7.237599
2 storing stopped at 7.257599
Data1
Time Incidente Transmitida DI 6
s um/m um/m -
0 2.1690152 140.98599 1
1E-6 2.1690152 140.98599 1
2E-6 4.3380303 145.32402 1
3E-6 4.3380303 145.32402 1
4E-6 -2.1690152 145.32402 1
I have several of these files that I want to loop trough and store in a cell/list that each cell/list item contains the four columns. After that I just use that cell/list to plot the data with a loop.
I saw that pandas library was suitable, but I don't understand how to use it.
fileNames = (["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"])
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Loop trough each source document
for i in range(0,len(fileNames)):
print('File location: '+folderName+fileNames[i])
# Get data from source as arrays, cut out the first 20 lines
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=[19], error_bad_lines=False)
# Store data in list/cell
# data[i] = temp # sort it
This is something I tried that didn't work, don't really know how to proceed. I know there are some documentation on this problem but I am new to this and need some help.
An error I get when trying the above:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 4
So it was an easy fix, just had to remove the braces from skiprows=[19].
The cods now looks like this and works.
fileNames = ["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"]
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Preallocation
data = []
for i in range(0,len(fileNames)):
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=19)
data.append(temp)
I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.
I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.
For instance, if I have two files that look like:
Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
and
LastName Name Age Year Color Size
Lily James 17 2021 green 0
How would I go about merging them like this in excel:
LastName Name Age Year Food Color Size
na Bob na 2018 Cake Blue na
na Charlie na 2017 Figs Red na
Lily James 17 2021 na green 0
Question: Merging inconsistent data in text files into a single excel spreadsheet
This solution is using the following build-in and moudules:
Set Types
Lists
CSV File Reading and Writing
Mapping Types — dict
The core of this solution is to normalize the columns names using a set() object and
the parameter .DictWriter(..., extrasaction='ignore') to handle the inconsistent columns.
The output format is CSV, which can be read from MS-Excel.
The given data, separated by blank
text1 = """Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
"""
text2 = """LastName Name Age Year Color Size
Lily James 17 2021 green 0
"""
Open three files an get the headers.
Aggregate all columns names, drop double names using a set().
Create a DictReader object for the in_* files.
Note: Replace io.StringIO(... with open(<Path to file>)
with io.StringIO(text1) as in_text1, \
io.StringIO(text2) as in_text2, \
io.StringIO() as out_csv:
columns = set()
reader = []
for n, fh in enumerate([in_text1, in_text2]):
fieldnames = fh.readline().rstrip().split()
[columns.add(name) for name in fieldnames]
reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))
Create a DictWriter object using the normalized column names.
The parameter extrasaction='ignore', handle the inconsistent columns.
Note: The column order is not guaranteed. If you need a defined order, sort the list(columns) to your needs before assigning to fieldnames=.
writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
writer.writeheader()
Loop all DictReader objects reading all lines and write it to the target .csv file.
for dictReader in reader:
for _dict in dictReader:
writer.writerow(_dict)
Output:
print(out_csv.getvalue())
Color,LastName,Year,Food,Age,Name,Size
Blue,,2018,Cake,,Bob,
Red,,2017,Figs,,Charlie,
green,Lily,2021,,17,James,0
Tested with Python: 3.4.2
If you were happy to work with the text files directly in Excel ... this will work but may need some refinement from yourself.
I understand it’s probably not what you’re looking for but it provides another option.
Open the Visual Basic editor, add a new module and copy the below code and paste in ...
Public Sub ReadAndMergeTextFiles()
Dim strSrcFolder As String, strFileName As String, strLine As String, strPath As String, bFirstLine As Boolean
Dim arrHeaders() As String, lngHeaderIndex As Long, arrFields, i As Long, objDestSheet As Worksheet, bFound As Boolean
Dim objLastHeader As Range, x As Long, lngLastColumn As Long, lngHeaderCol As Long, arrHeaderCols() As Long
Dim lngWriteRow As Long
lngLastColumn = 1
lngWriteRow = 2
Application.EnableEvents = False
Application.ScreenUpdating = False
' Change the sheet name being assigned to your destination worksheet name.
' Alternatively, display a prompt that asks for the sheet or simply uses the active sheet.
Set objDestSheet = Worksheets("Result")
With Application.FileDialog(msoFileDialogFolderPicker)
.Title = "Select Source Folder"
.Show
If .SelectedItems.Count = 1 Then
objDestSheet.Cells.Clear
strSrcFolder = .SelectedItems(1)
strFileName = Dir(strSrcFolder & "\*.txt")
Do While Len(strFileName) > 0
strPath = strSrcFolder & "\" & strFileName
Open strPath For Input As #1
bFirstLine = True
Do Until EOF(1)
Line Input #1, strLine
arrFields = Split(strLine, vbTab, , vbTextCompare)
lngHeaderIndex = -1
For i = 0 To UBound(arrFields)
If bFirstLine Then
' Loop through the header fields already written to the destination worksheet and find a match.
For x = 1 To objDestSheet.Columns.Count
bFound = False
If Trim(objDestSheet.Cells(1, x)) = "" Then Exit For
If UCase(objDestSheet.Cells(1, x)) = UCase(arrFields(i)) Then
lngHeaderCol = x
bFound = True
Exit For
End If
Next
If Not bFound Then
objDestSheet.Cells(1, lngLastColumn) = arrFields(i)
lngHeaderCol = lngLastColumn
lngLastColumn = lngLastColumn + 1
End If
lngHeaderIndex = lngHeaderIndex + 1
ReDim Preserve arrHeaderCols(lngHeaderIndex)
arrHeaderCols(lngHeaderIndex) = lngHeaderCol
Else
' Write out each value into the column found.
objDestSheet.Cells(lngWriteRow, arrHeaderCols(i)) = "'" & arrFields(i)
End If
Next
If Not bFirstLine Then
lngWriteRow = lngWriteRow + 1
End If
bFirstLine = False
Loop
Close #1
strFileName = Dir
Loop
objDestSheet.Columns.AutoFit
End If
End With
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
... I did some basic testing with the data you provided and it seemed to work. If for some reason it fails over the data you're using and you can't work it out, let me know and I'll put a fix in.
Some points ...
The order of the columns depends on the order of your files and which columns appear first. Of course, that could be enhanced upon but it is what it is for now.
It assumes all files in the one folder and all files end in .txt
The separator within each file is assumed to be a TAB.
Let me know if that helps.
I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)
Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))
You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))