Read multiple yaml files to pandas Dataframe - python

I do realize this has already been addressed here (e.g., Reading csv zipped files in python, How can I parse a YAML file in Python, Retrieving data from a yaml file based on a Python list). Nevertheless, I hope this question was different.
I know loading a YAML file to pandas dataframe
import yaml
import pandas as pd
with open(r'1000851.yaml') as file:
df = pd.io.json.json_normalize(yaml.load(file))
df.head()
I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame. I have not been able to figure it out though...
import pandas as pd
import glob
path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")
li = []
for filename in all_files:
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'
Sample Dataset Zipped
Sample Dataset
Is there a way to do this and read files efficiently?

It seems your first part of the code and the second one you added is different.
First part correctly reads yaml files, but the second part is broken:
for filename in all_files:
# `filename` here is just a string containing the name of the file.
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
The problem is that you need to read the files. Currently you're just giving the filename and not the file content. Do this instead
li=[]
# Only loading 3 files:
for filename in all_files[:3]:
with open(filename,'r') as fh:
df = pd.json_normalize(yaml.safe_load(fh.read()))
li.append(df)
len(li)
3
pd.concat(li)
output:
innings meta.data_version meta.created meta.revision info.city info.competition ... info.player_of_match info.teams info.toss.decision info.toss.winner info.umpires info.venue
0 [{'1st innings': {'team': 'Glamorgan', 'delive... 0.9 2020-09-01 1 Bristol Vitality Blast ... [AG Salter] [Glamorgan, Gloucestershire] field Gloucestershire [JH Evans, ID Blackwell] County Ground
0 [{'1st innings': {'team': 'Pune Warriors', 'de... 0.9 2013-05-19 1 Pune IPL ... [LJ Wright] [Pune Warriors, Delhi Daredevils] bat Pune Warriors [NJ Llong, SJA Taufel] Subrata Roy Sahara Stadium
0 [{'1st innings': {'team': 'Botswana', 'deliver... 0.9 2020-08-29 1 Gaborone NaN ... [A Rangaswamy] [Botswana, St Helena] bat Botswana [R D'Mello, C Thorburn] Botswana Cricket Association Oval 1
[3 rows x 18 columns]

Related

Flatten nested JSON into pandas dataframe columns

I have a pandas column with nested json data string. I'd like to flatten the data into multiple pandas columns.
Here's data from a single cell:
rent['ques'][9] = "{'Rent': [{'Name': 'Asking', 'Value': 16.07, 'Unit': 'Usd'}], 'Vacancy': {'Name': 'Vacancy', 'Value': 25.34100001, 'Unit': 'Pct'}}"
For each cell in pandas column, I'd like parse this string and create multiple columns. Expected output looks something like this:
When I run, json_normalize(rent['ques']), I receive the following error.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-cebc86357f34> in <module>()
----> 1 json_normalize(rentoff['Survey'])
/anaconda3/lib/python3.7/site-packages/pandas/io/json/normalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
196 if record_path is None:
197 if any([[isinstance(x, dict)
--> 198 for x in compat.itervalues(y)] for y in data]):
199 # naive normalization, this is idempotent for flat records
200 # and potentially will inflate the data considerably for
/anaconda3/lib/python3.7/site-packages/pandas/io/json/normalize.py in <listcomp>(.0)
196 if record_path is None:
197 if any([[isinstance(x, dict)
--> 198 for x in compat.itervalues(y)] for y in data]):
199 # naive normalization, this is idempotent for flat records
200 # and potentially will inflate the data considerably for
/anaconda3/lib/python3.7/site-packages/pandas/compat/__init__.py in itervalues(obj, **kw)
210
211 def itervalues(obj, **kw):
--> 212 return iter(obj.values(**kw))
213
214 next = next
AttributeError: 'str' object has no attribute 'values'
Try this:
df['quest'] = df['quest'].str.replace("'", '"')
dfs = []
for i in df['quest']:
data = json.loads(i)
dfx = pd.json_normalize(data, record_path=['Rent'], meta=[['Vacancy', 'Name'], ['Vacancy', 'Unit'], ['Vacancy', 'Value']])
dfs.append(dfx)
df = pd.concat(dfs).reset_index(drop=['index'])
print(df)
Name Value Unit Vacancy.Name Vacancy.Unit Vacancy.Value
0 Asking 16.07 Usd Vacancy Pct 25.341
1 Asking 16.07 Usd Vacancy Pct 25.341
2 Asking 16.07 Usd Vacancy Pct 25.341

How to have multiple rows under the same row index using pandas

I'm writing a script to normalise data from RT-PCR. I am reading the data from a tsv file and I'm struggling to put it into a pandas data frame so that it's usabale. The issue here is that the row index have the same name, is it possible to make it a hierarchal structure?
I'm using Python 3.6. I've tried .groupby() and .pivot() but I can't seem to get it to do what I want.
def calculate_peaks(file_path):
peaks_tsv = pd.read_csv(file_path, sep='\t', header=0, index_col=0)
My input file is this:
input file image
My expected output:
EMB.brep1.peak EMB.brep1.length EMB.brep2.peak EMB.brep2.length EMB.brep3.peak EMB.brep3.length
primer name
Hv161 0 19276 218.41 20947 218.39 21803 218.26
1 22906 221.35 26317 221.17 26787 221.21
Hv223 0 4100 305.24 5247 305.37 4885 305.25
1 2593 435.25 3035 435.30 2819 435.32
2 4864 597.40 5286 597.20 4965 596.60
Actual Output:
EMB.brep1.peak EMB.brep1.length EMB.brep2.peak EMB.brep2.length EMB.brep3.peak EMB.brep3.length
primer name
Hv161 19276 218.41 20947 218.39 21803 218.26
Hv161 22906 221.35 26317 221.17 26787 221.21
Hv223 4100 305.24 5247 305.37 4885 305.25
Hv223 2593 435.25 3035 435.30 2819 435.32
Hv223 4864 597.40 5286 597.20 4965 596.60
You can do this:
peaks_tsv = pd.read_csv(file_path, sep='\t', header=0)
peaks_tsv['idx'] = peaks_tsv.groupby('primer name').cumcount()
peaks_tsv.set_index(['primer name', 'idx'], inplace=True)

Pandas creating a consolidate report from excel

I have a excel file with below detail. I am trying to use panda to get only first 5 language and their sum in a excel
files language blank comment code
61 Java 1031 533 3959
10 Maven 73 66 1213
12 JSON 0 0 800
32 XML 16 74 421
7 HTML 14 16 161
1 Markdown 23 0 39
1 CSS 0 0 1
Below is my code
import pandas as pd
from openpyxl import load_workbook
df = pd.read_csv("myfile_cloc.csv", nrows=20)
#df = df.iloc[1:]
top_five = df.head(5)
print(top_five)
print(top_five['language'])
print(top_five['code'].sum())
d = {'Languages (CLOC) (Top 5 Only)': "", 'LOC (CLOC)Only Code': 0}
newdf = pd.DataFrame(data=d)
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
#Load excel to append the consolidated info
writer = newdf.ExcelWriter("myfile_cloc.xlsx", engine='openpyxl')
book = load_workbook('myfile_cloc.xlsx')
writer.book = book
newdf.to_excel(writer, sheet_name='top_five', index=False)
writer.save()
Need suggestion in these line
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
so that Expected Output can be
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
Java,Maven,JSON,XML,HTML 6554
Presently getting error
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
try this,
one way to solve this use index attribute
a=df.head()
df=pd.DataFrame({"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()},index=range(1))
another way to solve this,
use from_records and pass list of dict in Dataframe.
df=pd.DataFrame.from_records([{"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()}])
Output:
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
0 Java,Maven,JSON,XML,HTML 6554
import pandas as pd
sheet1 = pd.read_csv("/home/mycomputer/Desktop/practise/sorting_practise.csv")
sheet1.head()
sortby_blank=sheet1.sort_values('blank',ascending=False)
sortby_blank['blank'].head(5).sum()
values = sortby_blank['blank'].head(5).sum()
/home/nptel/Desktop/practise/sorting_practise.csv ---> File Directory
blank ---> Column you want to sort
use .tail() if you need bottom values
"values" variable will have the answer you are looking for

How to read binary compressed SAS files in chunks using panda.read_sas and save as feather

I am trying to use pandas.read_sas() to read binary compressed SAS files in chunks and save each chunk as a separate feather file.
This is my code
import feather as fr
import pandas as pd
pdi = pd.read_sas("C:/data/test.sas7bdat", chunksize = 100000, iterator = True)
i = 1
for pdj in pdi:
fr.write_dataframe(pdj, 'C:/data/test' + str(i) + '.feather')
i = i + 1
However I get the following error
ValueError Traceback (most recent call
last) in ()
1 i = 1
2 for pdj in pdi:
----> 3 fr.write_dataframe(pdj, 'C:/test' + str(i) + '.feather')
4 i = i + 1
5
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\feather.py
in write_feather(df, dest)
116 writer = FeatherWriter(dest)
117 try:
--> 118 writer.write(df)
119 except:
120 # Try to make sure the resource is closed
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\feather.py
in write(self, df)
94
95 elif inferred_type not in ['unicode', 'string']:
---> 96 raise ValueError(msg)
97
98 if not isinstance(name, six.string_types):
ValueError: cannot serialize column 0 named SOME_ID with dtype bytes
I am using Windows 7 and Python 3.6. When I inspect it most the columns' cells are wrapped in b'cell_value' which I assume to mean that the columns are in binary format.
I am a complete Python beginner so don't understand what is the issue?
Edit: looks like this was a bug patched in a recent version:
https://issues.apache.org/jira/browse/ARROW-1672
https://github.com/apache/arrow/commit/238881fae8530a1ae994eb0e283e4783d3dd2855
Are the column names strings? Are you sure pdj is of type pd.DataFrame?
Limitations
Some features of pandas are not supported in Feather:
Non-string column names
Row indexes
Object-type columns with non-homogeneous data

Average data based on specific columns - python

I have a data file with multiple rows, and 8 columns - I want to average column 8 of rows that have the same data on columns 1, 2, 5 - for example my file can look like this:
564645 7371810 0 21642 1530 1 2 30.8007
564645 7371810 0 21642 8250 1 2 0.0103
564645 7371810 0 21643 1530 1 2 19.3619
I want to average the last column of the first and third row since columns 1-2-5 are identical;
I want the output to look like this:
564645 7371810 0 21642 1530 1 2 25.0813
564645 7371810 0 21642 8250 1 2 0.0103
my files (text files) are pretty big (~10000 lines) and redundant data (based on the above rule) are not in regular intervals - so I want the code to find the redundant data, and average them...
in response to larsks comment - here are my 4 lines of code...
import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)
##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]
you can use pandas to do this quickly:
import pandas as pd
from StringIO import StringIO
data = StringIO("""564645 7371810 0 21642 1530 1 2 30.8007
564645 7371810 0 21642 8250 1 2 0.0103
564645 7371810 0 21643 1530 1 2 19.3619
""")
df = pd.read_csv(data, sep="\\s+", header=None)
df.groupby(["X.1","X.2","X.5"])["X.8"].mean()
the output is:
X.1 X.2 X.5
564645 7371810 1530 25.0813
8250 0.0103
Name: X.8
if you don't need index, you can call:
df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
this will give the result as:
X.1 X.2 X.5 X.8
0 564645 7371810 1530 25.0813
1 564645 7371810 8250 0.0103
Ok, based on Hury's input I updated the code -
import os #needed system utils
import numpy as np# for array data processing
import pandas as pd #import the pandas module
datadirectory = input('path to the data directory, ')
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)
##READ DATA FILE AND and convert it to string
dataset = open(input('dataset_to_be_used, ')).readlines()
data = ''.join(dataset)
df = pd.read_csv(data, sep="\\s+", header=None)
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)
this worked with the test data, as posted by hury - but when I use my file after the df = ... does not seem to work (I get an output like:
Traceback (most recent call last):
File "/media/DATA/arxeia/Programming/MyPys/data_refine_average.py", line 31, in
df = pd.read_csv(data, sep="\s+", header=None)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 141, in _read
f = com._get_handle(filepath_or_buffer, 'r', encoding=encoding)
File "/usr/lib64/python2.7/site-packages/pandas/core/common.py", line 673, in _get_handle
f = open(path, mode)
IOError: [Errno 36] File name too long: '564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r\n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0.0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t0\t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216..........
any ideas?
It's not the most elegant of answers, and I have no idea how fast/efficient it is, but I believe it gets the job done based on the information you provided:
import numpy
data_file = "full_location_of_data_file"
data_dict = {}
for line in open(data_file):
line = line.rstrip()
columns = line.split()
entry = [columns[0], columns[1], columns[4]]
entry = "-".join(entry)
try: #valid if have already seen combination of 1,2,5
x = data_dict[entry].append(float(columns[7]))
except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
data_dict[entry] = [float(columns[7])]
for entry in data_dict:
value = numpy.mean(data_dict[entry])
output = entry.split("-")
output.append(str(value))
output = "\t".join(output)
print output
I'm unclear if you want/need columns 3, 6, or 7 so I omited them. Particularly, you do not make clear how you want to deal with different values which may exist within them. If you can elaborate on what behavior you want (ie default to a certain value, or to the first occurrence) I'd suggest either filling in with default values or store the first instance in a dictionary of dictionaries rather than a dictionary of lists.
import os #needed system utils
import numpy as np# for array data processing
datadirectory = '/media/DATA/arxeia/Dimitris/Testing/12_11'
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)
##HERE I WAS TRYING TO READ THE FILE, AND THEN USE THE NAME OF THE STRING IN THE FOLLOWING LINE - THAT RESULTED IN THE SAME ERROR DESCRIBED BELOW (ERROR # 42 (I think) - too large name)
data_dict = {} #Create empty dictionary
for line in open('/media/DATA/arxeia/Dimitris/Testing/12_11/1a.dat'): ##above error resolved when used this
line = line.rstrip()
columns = line.split()
entry = [columns[0], columns[1], columns[4]]
entry = "-".join(entry)
try: #valid if have already seen combination of 1,2,5
x = data_dict[entry].append(float(columns[7]))
except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
data_dict[entry] = [float(columns[7])]
for entry in data_dict:
value = np.mean(data_dict[entry])
output = entry.split("-")
output.append(str(value))
output = "\t".join(output)
print output
MY OTHER PROBLEM NOW IS GETTING OUTPUT IN STRING FORMAT (OR ANY FORMAT) - THEN I BELIEVE I KNOW I CAN GET TO THE SAVE PART AND MANIPULATE THE FINAL FORMAT
np.savetxt('sorted_data.dat', sorted, fmt='%s', delimiter='\t') #Save the data
I STILL HAVE TO FIGURE HOW TO ADD THE OTHER COLUMNS - I AM WORKING ON THAT TOO

Categories

Resources