Create nested json lines from pipe delimited flat file using python

Create nested json lines from pipe delimited flat file using python - python

I have a text file pipe delimited as below. In that file for same ID, CODE and NUM combination we can have different INC and INC_DESC
ID|CODE|NUM|INC|INC_DESC
"F1"|"W1"|1|1001|"INC1001"
"F1"|"W1"|1|1002|"INC1002"
"F1"|"W1"|1|1003|"INC1003"
"F2"|"W1"|1|1002|"INC1003"
"F2"|"W1"|1|1003|"INC1004"
"F2"|"W2"|1|1003|"INC1003"
We want to create json like below where different INC and INC_DESC should come as an array for same combination of ID, CODE and NUM
{"ID":"F1","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1001, "INC_DESC":"INC1001"},{"INC":1002, "INC_DESC":"INC1002"},{"INC":1003, "INC_DESC":"INC1003"}]}
{"ID":"F2","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1002, "INC_DESC":"INC1002"},{"INC":1003, "INC_DESC":"INC1003"}]}
{"ID":"F2","CODE":"W2","NUM":1,"INC_DTL":[{"INC":1003, "INC_DESC":"INC1003"}]}
I tried below but it is not generating nested as I want
import pandas as pd
Input_File=f'V:\input.dat'
df=pd.read_csv(Input_File, sep='|')
json_output=f'V:\outfile.json'
output=df.to_json(json_output, orient='records')

import pandas as pd
# agg function
def agg_that(x):
l = [x]
return l
Input_File = f'V:\input.dat'
df = pd.read_csv(Input_File, sep='|')
# groupby columns
df = df.groupby(['ID', 'CODE', 'NUM']).agg(agg_that).reset_index()
# create new column
df['INC_DTL'] = df.apply(
lambda x: [{'INC': inc, 'INC_DESC': dsc} for inc, dsc in zip(x['INC'][0], x['INC_DESC'][0])], axis=1)
# drop old columns
df.drop(['INC', 'INC_DESC'], axis=1, inplace=True)
json_output = f'V:\outfile.json'
output = df.to_json(json_output, orient='records', lines=True)
OUTPUT:
{"ID":"F1","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1001,"INC_DESC":"INC1001"},{"INC":1002,"INC_DESC":"INC1002"},{"INC":1003,"INC_DESC":"INC1003"}]}
{"ID":"F1","CODE":"W2","NUM":1,"INC_DTL":[{"INC":1003,"INC_DESC":"INC1003"}]}
{"ID":"F2","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1002,"INC_DESC":"INC1003"},{"INC":1003,"INC_DESC":"INC1004"}]}

Related

Better way to parse multiple files and create a single dataframe

I want to:
Read a file into a dataframe
Do some data manipulation, etc.
Copy one column from the dataframe
Append that column to a second dataframe
Repeat 1-4 until all files are read
My implementation is:
all_data = [[]] #list to store each set of values
for i in file_list:
filepath = path + i
df=pd.read_csv(filepath,sep='\t',header=None,names=colsList)
#various data manipulation, melt, etc, etc, etc.
all_data.append(df['value'])
df_all = pd.DataFrame(all_data)
df_all=df_all.T #Transpose
df_all.set_axis(name_list, axis=1, inplace=True) #fix the column names
How could this have been better implemented?
Problems:
the data in the python list is transposed (appended by rows not columns)
I couldn't find a way to append by columns or transpose the list (with python list or with pandas) that would work without an error :(
Thanks in advance...

If you would keep data in dictionary then you would get columns.
But every column need uniq name - i.e. col1, col2, ect.
import pandas as pd
all_data = {}
all_data['col1'] = [1,2,3]
all_data['col2'] = [4,5,6]
all_data['col3'] = [7,8,9]
new_df = pd.DataFrame(all_data)
print(new_df)
Result:
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
The same with for-loop
I use io.StringIO only to simulate files in memory - but you should use directly path to file.
import pandas as pd
import io
file_data = {
'file1.csv': '1\t101\n2\t102\n3\t103',
'file2.csv': '4\t201\n5\t202\n6\t202',
'file3.csv': '7\t301\n8\t301\n9\t201',
}
file_list = [
'file1.csv',
'file2.csv',
'file3.csv',
]
# ---
all_data = {}
for number, i in enumerate(file_list, 1):
df = pd.read_csv( io.StringIO(file_data[i]), sep='\t', header=None, names=['value', 'other'] )
all_data[f'col{number}'] = df['value']
new_df = pd.DataFrame(all_data)
print(new_df)
You can also directly assign new column
new_df[f'column1'] = old_df['value']
import pandas as pd
import io
file_data = {
'file1.csv': '1\t101\n2\t102\n3\t103',
'file2.csv': '4\t201\n5\t202\n6\t202',
'file3.csv': '7\t301\n8\t301\n9\t201',
}
file_list = [
'file1.csv',
'file2.csv',
'file3.csv',
]
# ---
new_df = pd.DataFrame()
for number, i in enumerate(file_list, 1):
df = pd.read_csv( io.StringIO(file_data[i]), sep='\t', header=None, names=['value', 'other'] )
new_df[f'col{number}'] = df['value']
print(new_df)

Pandas - Loop through sheets

I have 5 sheets and created a script to do numerous formatting, I tested it per sheet, and it works perfectly.
import numpy as np
import pandas as pd
FileLoc = r'C:\T.xlsx'
Sheets = ['Alex','Elvin','Gerwin','Jeff','Joshua',]
df = pd.read_excel(FileLoc, sheet_name= 'Alex', skiprows=6)
df = df[df['ENDING'] != 0]
df = df.head(30).T
df = df[~df.index.isin(['Unnamed: 2','Unnamed: 3','Unnamed: 4','ENDING' ,3])]
df.index.rename('STORE', inplace=True)
df['index'] = df.index
df2 = df.melt(id_vars=['index', 2 ,0, 1] ,value_name='SKU' )
df2 = df2[df2['variable']!= 3]
df2['SKU2'] = np.where(df2['SKU'].astype(str).fillna('0').str.contains('ALF|NOB|MET'),df2.SKU, None)
df2['SKU2'] = df2['SKU2'].ffill()
df2 = df2[~df2[0].isnull()]
df2 = df2[df2['SKU'] != 0]
df2[1] = pd.to_datetime(df2[1]).dt.date
df2.to_excel(r'C:\test.xlsx', index=False)
but when I assigned a list in Sheet_name = Sheets it always produced an error KeyError: 'ENDING'. This part of the code:
Sheets = ['Alex','Elvin','Gerwin','Jeff','Joshua',]
df = pd.read_excel(FileLoc,sheet_name='Sheets',skiprows=6)
Is there a proper way to do this, like looping?
My expected result is to execute the formatting that I have created and consolidate it into one excel file.
NOTE: All sheets have the same format.

In using the read_excel method, if you give the parameter sheet_name=None, this will give you a OrderedDict with the sheet names as keys and the relevant DataFrame as the value. So, you could apply this and loop through the dictionary using .items().
The code would look something like this,
dfs = pd.read_excel('your-excel.xlsx', sheet_name=None)
for key, value in dfs.items():
# apply logic to value
If you wish to combine the data in the sheets, you could use .append(). We can append the data after the logic has been applied to the data in each sheet. That would look something like this,
combined_df = pd.DataFrame()
dfs = pd.read_excel('your-excel.xlsx', sheet_name=None)
for key, value in dfs.items():
# apply logic to value, which is a DataFrame
combined_df = combined_df.append(sheet_df)

Python Normalize JSON to DataFrame

I have been trying to normalize this JSON data for quite some time now, but I am getting stuck at a very basic step. I think the answer might be quite simple. I will take any help provided.
import json
import urllib.request
import pandas as pd
url = "https://www.recreation.gov/api/camps/availability/campground/232447/month?start_date=2021-05-01T00%3A00%3A00.000Z"
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode())
#data = json.dumps(data, indent=4)
df = pd.json_normalize(data = data['campsites'], record_path= 'availabilities', meta = 'campsites')
print(df)
My Expected df result is as following:
Expected DataFrame Output:

One approach (not using pd.json_normalize) is to iterate through a list of the unique campsites and convert the data for each campsite to a DataFrame. The list of campsite-specific DataFrames can then be concatenated using pd.concat.
Specifically:
## generate a list of unique campsites
unique_campsites = [item for item in data['campsites'].keys()]
## function that returns a DataFrame for each campsite,
## renaming the index to 'date'
def campsite_to_df(data, campsite):
out_df = pd.DataFrame(data['campsites'][campsite]).reset_index()
out_df = out_df.rename({'index': 'date'}, axis = 1)
return out_df
## generate a list of DataFrames, one per campsite
df_list = [campsite_to_df(data, cs) for cs in unique_campsites]
## concatenate the list of DataFrames into a single DataFrame,
## convert campsite id to integer and sort by campsite + date
df_full = pd.concat(df_list)
df_full['campsite_id'] = df_full['campsite_id'].astype(int)
df_full = df_full.sort_values(by = ['campsite_id','date'],
ascending = True)
## remove extraneous columns and rename campsite_id to campsites
df_full = df_full[['campsite_id','date','availabilities',
'max_num_people','min_num_people','type_of_use']]
df_full = df_full.rename({'campsite_id': 'campsites'}, axis = 1)

Convert list of multiple strings into a Python data frame

I have a list of string values I read this from a text document with splitlines. which yields something like this
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
I have tried this
for i in X:
textnew = i.split("|")
data[x] = textnew
I want to make a dataframe out of this
Name Contact Education
SMITH 12345 Graduate
NITA 11111 Diploma

You can read it directly from your file by specifying a sep argument to pd.read_csv.
df = pd.read_csv("/path/to/file", sep='|')
Or if you wish to convert it from list of string instead:
data = [row.split('|') for row in X]
headers = data.pop(0) # Pop the first element since it's header
df = pd.DataFrame(data, columns=headers)

you had it almost correct actually, but don't use data as dictionary(by using keys - data[x] = textnew):
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
df = []
for i in X:
df.append(i.split("|"))
print(df)
# [['NAME', 'Contact', 'Education'], ['SMITH', '12345', 'Graduate'], ['NITA', '11111', 'Diploma']]
Depends on further transformations, but pandas might be overkill for this kind of task

Here is a solution for your problem
import pandas as pd
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
data = []
for i in X:
data.append( i.split("|") )
df = pd.DataFrame( data, columns=data.pop(0))

In your situation, you can avoid to load the file using readlines and use pandas for take care about loading the file:
As mentioned above, the solution is a standard read_csv:
import os
import pandas as pd
path = "/tmp"
filepath = "file.xls"
filename = os.path.join(path,filepath)
df = pd.read_csv(filename, sep='|')
print(df.head)
Another approach (in such situation when you have no access to the file or you have to deal with a list of string) can be wrap the list of string as a text file, then load normally using pandas
import pandas as pd
from io import StringIO
X = ["NAME|Contact|Education", "SMITH|12345|Graduate", "NITA|11111|Diploma"]
# Wrap the string list as a file of new line
DATA = StringIO("\n".join(X))
# Load as a pandas dataframe
df = pd.read_csv(DATA, delimiter="|")
Here the result

multiple columns from a file into a single column of lists in pandas

I'm new to pandas , and need to prepare a table using pandas , imitating exact function performed by following code snippet:
with open(r'D:/DataScience/ml-100k/u.item') as f:
temp=''
for line in f:
fields = line.rstrip('\n').split('|')
movieId = int(fields[0])
name = fields[1]
geners = fields[5:25]
geners = map(int, geners)
My question is how to add a geners column in pandas having same :
geners = fields[5:25]

It's not clear to me what you intend to accomplish -- a single genres column containing fields 5-25 concatenated? Or separate genre columns for fields 5-25?
For the latter, you can use [pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html):
import pandas as pd
cols = ['movieId', 'name'] + ['genre_' + str(i) for i in range(5, 25)]
df = pd.read_csv(r'D:/DataScience/ml-100k/u.item', delimiter='|', names=cols)
For the former, you could concatenate the genres into say, a space-separated list, using:
df['genres'] = df[cols[2:]].apply(lambda x: ' '.join(x), axis=1)
df.drop(cols[2:], axis=1, inplace=True) # drop the separate genre_N columns

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create nested json lines from pipe delimited flat file using python - python

Related

Better way to parse multiple files and create a single dataframe

Pandas - Loop through sheets

Python Normalize JSON to DataFrame

Convert list of multiple strings into a Python data frame

multiple columns from a file into a single column of lists in pandas

Categories

Resources