Am applying some method / property out of a custom lib in python mlfinlab.
When I apply this calculation (here is the link). Under get_ema_dollar_imbalance_bars , then am getting back an output with
<class 'tuple'> as a type of the object, then towards the end of the output space I get the below
[4798 rows x 10 columns],
Empty DataFrame
Columns: []
Index: [] , dtype=object)
what am trying to do is to convert it into a pandas dataframe.
I have tried different methods around including transforming it first to a NumPy array then I can get in to pandas df format, but it is still not working.
Any help/guidance is appreciated and welcomed on your part.
I have included some snapshots as well.
best regards
First option:
Convert your tuple to a list with:
lst = list(yourtuple)
Convert to a dataframe:
df = pd.DataFrame(lst)
Second Option:
I can't see the formatting of your tuple, but assuming you need more than one column, and your tuple is structured in a way where it's ((col1,entry1,entry2),(col2,entry1,entry2)), you can do the following:
Create a dictionary:
newdict = dict(yourtuple)
Convert to a dataframe:
df = pd.DataFrame([newdict])
Related
I have two columns in my DataFrame which I format to a specific time format. It works with two lines of code below but I want to combine into one command
df['Time01'] = pd.to_datetime(Time_01).strftime('%H:%M:%S')
df['Time02'] = pd.to_datetime(Time_02).strftime('%H:%M:%S')
I have tried the following
df[['Time_01','Time_02']].apply(pd.to_datetime, format = '%H:%M:%S')
But get the following error message
None of [Index(['Time_01', 'Time_02'], dtype='object')] are in the
[columns]
New python and pandas any help appreciated
You proposed solution doesn't work because as the error says, there are no columns "Time_01" and "Time_02" yet in df and Time_01 and Time_02 that are converted to pandas datetime objects are objects independent of df. One way to write the first two lines into a single line is to write it in a dict comprehension and pass the resulting dictionary to the assign method:
df = df.assign(**{f'Time0{i+1}': pd.to_datetime(lst).strftime('%H:%M:%S')
for i, lst in enumerate((Time_01, Time_02))})
I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for
I have a list having Pandas Series objects, which I've created by doing something like this:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
where input_df is a Pandas Dataframe
I want to convert this list of Series objects back to Pandas Dataframe object, and was wondering if there is some easy way to do it
Based on the post you can do this by doing:
pd.DataFrame(li)
To everyone suggesting pd.concat, this is not a Series anymore. They are adding values to a list and the data type for li is a list. So to convert the list to dataframe then they should use pd.Dataframe(<list name>).
Since the right answer has got hidden in the comments, I thought it would be better to mention it as an answer:
pd.concat(li, axis=1).T
will convert the list li of Series to DataFrame
It seems that you wish to perform a customized melting of your dataframe.
Using the pandas library, you can do it with one line of code. I am creating below the example to replicate your problem:
import pandas as pd
input_df = pd.DataFrame(data={'1': [1,2,3,4,5]
,'2': [1,2,3,4,5]
,'3': [1,2,3,4,5]
,'4': [1,2,3,4,5]
,'5': [1,2,3,4,5]})
Using pd.DataFrame, you will be able to create your new dataframe that melts your two selected lists:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
new_df = pd.DataFrame(li)
if what you want is that those two lists present themselves under one column, I would not pass them as list to pass those list back to dataframe.
Instead, you can just append those two columns disregarding the column names of each of those columns.
new_df = input_df.iloc[0].append(input_df.iloc[4])
Let me know if this answers your question.
The answer already mentioned, but i would like to share my version:
li_df = pd.DataFrame(li).T
If you want each Series to be a row of the dataframe, you should not use concat() followed by T(), unless all your values are of the same datatype.
If your data has both numerical and string values, then the transpose() function will mangle the dtypes, likely turning them all to objects.
The right way to do this in general is:
Convert each series to a dict()
Pass the list of dicts either into the pd.Dataframe() constructor directly or use pd.Dataframe.from_dicts and set the orient keyword to "index."
In your case the following should work:
my_list_of_dicts = [s.to_dict() for s in li]
my_df = pd.Dataframe(my_list_of_dicts)
I have a csv with 4000 over data, in which each cells contains a tuple which holds a specific coordinated. I will like to convert it to a numpy array to work with. I use pandas to convert it into a DataFrame before calling df.values. However after calling df.values, the tuple becomes a string "(x,y)", instead. Is it possible to prevent this happening? Thank you.
df = pd.read_csv(sample_data)
array = df.values
I think problem is from csv always get tuples as strings.
So need convert them:
import ast
df['col'] = df['col'].apply(ast.literal_eval)
Or if all columns are tuples:
df = df.applymap(ast.literal_eval)
It seems that you read the file from local path ?
My answer is use eval to change the string:
df.apply(lambda x:x.apply(eval))
Another way to change the data type after reading the csv:
df['col'].apply(tuple)
I am using python 2.7 with dask
I have a dataframe with one column of tuples that I created like this:
table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe)
I want to re convert this tuple column into two seperate columns
In pandas I would do it like this:
table[[col1,col2]] = table[col].apply(pd.Series)
The point of doing so, is that dask dataframe does not support multi index and i want to use groupby according to multiple columns, and wish to create a column of tuples that will give me a single index containing all the values I need (please ignore efficiency vs multi index, for there is not yet a full support for this is dask dataframe)
When i try to unpack the tuple columns with dask using this code:
rxTable[["a","b"]] = rxTable["tup"].apply(lambda x: s(x), meta = pd.DataFrame, axis = 1)
I get this error
AttributeError: 'Series' object has no attribute 'columns'
when I try
rxTable[["a","b"]] = rxTable["tup"].apply(dd.Series, axis = 1, meta = pd.DataFrame)
I get the same
How can i take a column of tuples and convert it to two columns like I do in Pandas with no problem?
Thanks
Best i found so for in converting into pandas dataframe and then convert the column, then go back to dask
df1 = df.compute()
df1[["a","b"]] = df1["c"].apply(pd.Series)
df = dd.from_pandas(df1,npartitions=1)
This will work well, if the df is too big for memory, you can either:
1.compute only the wanted column, convert it into two columns and then use merge to get the split results into the original df
2.split the df into chunks, then converting each chunk and adding it into an hd5 file, then using dask to read the entire hd5 file into the dask dataframe
I found this methodology works well and avoids converting the Dask DataFrame to Pandas:
df['a'] = df['tup'].str.partition(sep)[0]
df['b'] = df['tup'].str.partition(sep)[2]
where sep is whatever delimiter you were using in the column to separate the two elements.