Saving results of df.show() - python

I need to capture the contents of a field from a table in order to append it to a filename. I have sorted the renaming process. Is there anyway I can save the output of the following in order to append it to renamed file? I can't use Scala, it has to be in python
df = sqlContext.sql("select replace(value,'-','') from dbsmets1mig02_technical_build.tbl_Tech_admin_data where type = 'Week_Starting'")
df.show()

Have you tried using indexing?
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df.iloc[3][1]
The syntax is df.iloc[< index of row containing desired enty >][< position of your entry >]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

You could convert the df object into a pandas DataFrame/Series object, then can use other Python commands more easily on this;
pandasdf = df.toPandas()
Once you have this as a pandas data frame - say it looks something like this;
import pandas as pd
pandasdf = pd.DataFrame({"col1" : ["20191122"]})
Then you can pull out the string and use f strings to join it into a filename;
date = pandasdf.iloc[0, 0]
filename = f"my_file_{date}.csv"
Then we have the filename object as 'my_file_20191122.csv'

Related

How do conver and I access the properties of an array of objects python saved as string

I have the following array structure, which I consume from a .csv file
0,Done,"[{'id': '7-84-1811', 'idType': 'CIP', 'suscriptionId': '89877485'}]"
0,Done,"[{'id': '1-232-42', 'idType': 'IO', 'suscriptionId': '23532r32'}]"
0,Done,"[{'id': '2323p23', 'idType': 'LP', 'suscriptionId': 'e32e23dw'}]"
0,Done,"[{'id': 'AU23242', 'idType': 'LL', 'suscriptionId': 'dede143234'}]"
To be able to handle it with pandas, I created its respective columns, but I only need to access the "id" and "idType" properties.
My code
from pandas.io.json import json_normalize
import pandas as pd
path = 'path_file'
df_fet = pd.read_csv(path, names=['error', 'resul', 'fields'])
df_work = df_fet[['fields'][0]['id', 'idType']]
print(df_work.head())
Retorn error
TypeError: string indices must be integers
desired output
id, idType
0. '7-84-1811', 'CIP'
1. '1-232-42', 'IO'
...
Here's a way to achieve the desired output
import pandas as pd
path = 'filepath'
df = pd.read_csv(path, names=['error', 'resul', 'fields'])
df["fields"] = df["fields"].fillna("[]").apply(lambda x: eval(x))
arr = []
for row in df["fields"]:
arr.append([row[0]["id"], row[0]["idType"]])
new = pd.DataFrame(arr, columns=["id", "idType"])
print(new)
Output:
Using eval() function python interprets the argument as a python expression thus the string is interpreted as a list itself

converting a whole table/dataframe in pyarrow a dictionnay_encoded columns

am loading a parquet file from apache arrow (pyarrow), and so far, i necessarily needs to transfer to pandas, doing a conversion as categorical, and send it back as arrow table (to save it later as feather file type)
the code looks like it :
df = pq.read_table(inputFile)
# convert to pandas
df2 = df.to_pandas()
# get all cols that needs to be transformed and cast
list_str_obj_cols = df2.columns[df2.dtypes == "object"].tolist()
for str_obj_col in list_str_obj_cols:
df2[str_obj_col] = df2[str_obj_col].astype("category")
print(df2.dtypes)
#get back from pandas to arrow
table = pa.Table.from_pandas(df2)
# write the file in fs
ft.write_feather(table, outputFile, compression='lz4')
is there anyway to make this directly with pyarrow ? would it be faster ?
thanks in advance
In pyarrow "categorical" is referred to as "dictionary encoded". So I think your question is if it is possible to dictionary encode columns from an existing table. You can use the pyarrow.compute.dictionary_encode function to do this. Putting it all together:
import pyarrow as pa
import pyarrow.compute as pc
def dict_encode_all_str_columns(table):
new_arrays = []
for index, field in enumerate(table.schema):
if field.type == pa.string():
new_arr = pc.dictionary_encode(table.column(index))
new_arrays.append(new_arr)
else:
new_arrays.append(table.column(index))
return pa.Table.from_arrays(new_arrays, names=table.column_names)
table = pa.Table.from_pydict({'int': [1, 2, 3, 4], 'str': ['x', 'y', 'x', 'y']})
print(table)
print(dict_encode_all_str_columns(table))

Why does setting key='table' in pd.DataFrame.to_hdf() create an extra empty key in the resulting hdf?

When writing a pandas DataFrame to hdf, if key is set to 'table' then the resulting hdf contains an empty key '/'. Other string values I have tried do not do this, and it seems strange that behaviour would depend on the name of a key. Why does this happen?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.to_hdf('hdf1', key='a_key_that_is_not_table')
>>> df.to_hdf('hdf2', key='table')
>>> store1 = pd.HDFStore('hdf1')
>>> store2 = pd.HDFStore('hdf2')
>>> store1.keys()
['/a_key_that_is_not_table']
>>> store2.keys()
['/', '/table']
Updated example script:
#!/usr/bin/python3
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
keys = ['a_key_that_is_not_table', 'table']
for idx, key in enumerate(keys):
filename = f'df{idx}.h5'
df.to_hdf(filename, key=key, mode='w', format='table')
store = pd.HDFStore(filename)
print(f'Loop {idx}, key = {key}, store.keys() ={store.keys()}')
store.close()
Output:
Loop 0, key = a_key_that_is_not_table, store.keys() =['/a_key_that_is_not_table']
Loop 1, key = table, store.keys() =['/', '/table']
Every HDF5 file has a "root group" referenced as "/". If you impsect both files with HDFView, you will find each has 1 group (named '/a_key_that_is_not_table' in file df0.h5 and '/table' in file df1.h5), So, it's not an error from HDF5 schema standpoint.
Looking deeper into the files, I suspect the issue is from Pandas abstraction layer on top of PyTables. Both files have the same schema. Under each named key (HDF5 group) there is a group named '_i_table' which has a subgroup named 'index' and a dataset named 'table'. Likely 'table' is a reserved name, and using it as a key trips up Pandas key name logic. Changing 'table' to 'Table' eliminates the '/' in the output for df1.h5.
You need to specify the format of your file. Plus, I think that you need to add a "w" (write) mode to the function because the default mode is set to "a" (append). For example :
df.to_hdf('data.h5', key='df', mode='w')
If you need to see more :
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html

save dataframe containing lists as csv file

I have created DataFrame as the following
df = pd.DataFrame({'name': imgname, 'pose': pose})
where imgname is a list of string such as ['image1','image2' ...]
the pose is a list of list such as pose = [array([ 55.77614093, 8.45208199, 2.69841043, 2.17110961]),
array([ 66.61236215, 5.87653161, -31.70704038, -21.68979529])]
I use this line to write the Dataframe to csv file
df.to_csv('dataset.csv',index = 'False')
However, the pose col is converted to a string with '\n' and extra spaces
How can I save the values as numbers in csv format
For pose, you probably do not mean arrays as list of list.
Your code would work if you remove the array part -
import pandas as pd
imgname = ['image1','image2']
pose = [[ 55.77614093, 8.45208199, 2.69841043, 2.17110961],[6.61236215, 5.87653161, -31.70704038, -21.68979529]]
df = pd.DataFrame({'name': imgname, 'pose': pose})
df output -
name pose
0 image1 [55.77614093, 8.45208199, 2.69841043, 2.17110961]
1 image2 [66.61236215, 5.87653161, -31.70704038, -21.68...
after that,
df.to_csv('dataset.csv',index = 'False')
works just fine.
df = pd.DataFrame({'Data': [np.array([1, 2, 3, 4]), np.array([5,6,7,8])], 'IMAGE' : ['IMAGE1', 'IMAGE2']})
print (df)
df.to_csv('dataset.csv',index = 'False')
and the result is the following :

How to read this JSON file in Python?

I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?
import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]
You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)
sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])

Categories

Resources