read and write a dataset description as a pandas dataframe "attribute"? - python

The R-help for feather_metadata states "Returns the dimensions, field names, and types; and optional dataset description."
#hrbrmstr Kindly posted a PR to answer this SO question and make it possible to add a dataset description to a feather file from R.
I'd like to know if it is possible to read (and write) such a dataset description in python / pandas using feather.read_dataframe and feather.write_dataframe as well? I searched the documentation but couldn't find any information about this. It was hoping something like the following might work:
import feather
import pandas as pd
dat = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
dat._metadata = "A dataset description ..."
feather.write_dataframe(dat, "pydf.feather")
Or else perhaps:
feather.write_dataframe(dat, "pydf.feather", "A dataset description ...")

Related

Get column value of row with condition based on another column

I keep writing code like this, when I want a specific column in a row, where I select the row first based on another column.
my_col = "Value I am looking for"
df.loc[df["Primary Key"] == "blablablublablablalblalblallaaaabblablalblabla"].iloc[0][
my_col
]
I don't know why, but it seems weird. Is there a more beautiful solution to this?
It would be helpful with a complete minimally working example, since it is not clear what your data structure looks like. You could use the example given here:
import pandas as pd
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
If you are then trying to e.g. select the viper-row based on its max_speed, and then obtain its shield-value like so:
my_col = "shield"
df.loc[df["max_speed"] == 4].iloc[0][my_col]
then I guess that is the way to do that - not a lot of fat in that command.

Pandas convert columns to rows

I am a beginner in Data Science and I am trying to pivot this data frame using Pandas:
So it becomes something like this: (The labels should become the column and file paths the rows.)
I tried this code which gave me an error:
EDIT:
I have tried Marcel's suggestion, the output it gave is this:
The "label" column is a group or class of file paths. I want to convert it in such a way it fits this function: tf.Keras.preprocessing.image.flow_from_dataframe in categorical
Thanks in advance to all for helping me out.
I did not understand your question very well, but if you just want to convert columns to rows then you can do
train_df.T
wich means transpose
I think you are looking for something like this:
import pandas as pd
df = pd.DataFrame({
'labels': ['a', 'a', 'a', 'b', 'b'],
'pathes' : [1, 2, 3, 4, 5]
})
labels = df['labels'].unique()
new_cols = []
for label in labels:
new_cols.append(df['pathes'].where(df['labels'] == label).dropna().reset_index(drop=True))
df_final = pd.concat(new_cols, axis=1)
print(df_final)
I've found what was wrong, I misunderstood y_col and x_col in tf.Keras.preprocessing.image.ImageDataGenerator.flow_from_dataframe. Thanks to all of you for your contributions. Your answers are all correct in different ways. Thanks again Marcel h and user16714199!

How can i find out different languages in pandas dataframe other than english

I have a pandas dataframe. The TEXT column contains mix of different languages. The data volume is huge (more than 40k records) and hence I'm getting time out exception while applying google translator. Hence I'm trying to identify languages other than english and then apply translator only those rows.
For your reference I've created below sample dataframe, and apply language detect.
import pandas as pd
mydata = {'ID': [1, 2, 3, 4, 5],
'TEXT': ['How are you', '你好吗 hope all good', '元気ですか', 'cómo estás' , 'Hope all good 元気ですか']}
df = pd.DataFrame(mydata)
df['Language'] = df['TEXT'].apply(detect)
print(df)
My output looks like :
where I'm expecting the Language column of ID #2 should contains "cn" and ID #5 contains "ja".
Any other language combining with english is appearing as "en".

Strange layout of the HDF tables from pandas.HDFStore

When I output a pandas.DataFrame as a table in HDFStore:
import pandas as pd
df=pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=range(2))
with pd.HDFStore("test.hdf5") as store:
store.put("test", df, format="table")
I get the following layout when reading in ViTables:
I can correctly read it back with pandas.read_hdf(), but I find the data difficult to read: It's in these blocks, and the name of the columns is hidden by a values_block_0 label.
Is there a way to have a more intuitive layout in the HDF?
Adding datacolumns=True in store.put() arguments gives a better layout:

Is there a way to allow NaN values to be writen to CSV from panadas?

I have the following error builtins.AssertionError: 12 columns passed, passed data had 6 columns The last 6 Columns datawise will vary so Im happy to have None in the areas the data is missing. However I cant seem to find a simple way to do this, im pretty sure there must be an option for it but I cant see it in the docs or any google searches.
Any help would be apprecaited. I would like to reiterate that I know what is causing the problem and I know data is missing from coloumns. I would like to ignore missing data and am ahppy to have None or NaN in the output csv.
I imagine you have fixed headers, so my solution would be to extend each row respectively:
import pandas as pd
import numpy as np
columns = ('Person', 'Title', 'AnotherPerson', 'AnotherPerson2', 'AnotherPerson3', 'AnotherPerson4', 'Date', 'Group')
mandatory = len(columns)
data = [[1,2,3], [1, 2], [1, 2, 3, 4]]
data = list(map(lambda x: dict(enumerate(x)), data))
data = [[item.get(i, np.nan) for i in range(mandatory)] for item in data]
df = pd.DataFrame(data=data, columns=columns)

Categories

Resources