python pandas quantlib.time.date.Date - python

I have two dataframes:
import pandas as pd
from quantlib.time.date import Date
cols = ['ColStr','ColDate']
dataset1 = [['A',Date(2017,1,1)],['B',Date(2017,2,2)]]
x = pd.DataFrame(dataset1,columns=cols)
dataset2 = [['A','2017-01-01'],['B','2017-02-04']]
y = pd.DataFrame(dataset2,columns=cols)
Now, I want to compare the two table. I have written another set of code that compares the two (larger) dataframes and works for strings and numerical value.
My problem is - with column 'ColDate' one being string type and other being Date type, I am not able to validate if 'ColStr' = A is a match and 'ColStr' = 'B' is a mismatch.
I would have to
(1) either convert y.ColDate to Date
(2) or convert x.ColDate to str with a similar format as y.ColDate.
How do I achieve one or the other

I guess that you need to cast them to a single common type using something like dataset1['ColDate'] = dataset1.ColDate.map(convert_type) or any other method to iterate column values. Check other functions from pandas docs like apply().
The convert_type function should be defined in your program and accept a single argument to be passed into map().
And, when the columns have same types, you can compare them using any method you like.

You probably want to use the dt.strftime() function.
dataset1[0].dt.strftime("%Y-%m-%d")

Related

Filtering Pandas Dataframe Based on List of Column Names

I have a pandas dataframe which has may be 1000 Columns. However I do not need so many columns> I need columns only if they match/starts/contains specific strings.
So lets say I have a dataframe columns like
df.columns =
HYTY, ABNH, CDKL, GHY#UIKI, BYUJI##hy BYUJI#tt BBNNII#5 FGATAY#J ....
I want to select columns whose name are only like HYTY, CDKL, BYUJI* & BBNNI*
So what I was trying to do is to create a list of regular expressions like:
import re
relst = ['HYTY', 'CDKL*', 'BYUJI*', 'BBNI*']
my_w_lst = [re.escape(s) for s in relst]
mask_pattrn = '|'.join(my_w_lst)
Then I create the logical vector to give me a list of TRUE/FALSE to say whether the string is present or not. However, not understanding how to get the dataframe of only those true selected columns from this.
Any help will be appreciated.
Using what you already have you can pass your mask to filter like:
df.filter(regex=mask_pattrn)
Use re.findall(). It will give you a list of columns to pass to df[mylist]
We can do startswith
relst = ['CDKL', 'BYUJI', 'BBNI']
subdf = df.loc[:,df.columns.str.startswith(tuple(relst))|df.columns.isin(['HYTY'])]

How to make an object to be a dataframe in python

I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for

List of Series to Dataframe

I have a list having Pandas Series objects, which I've created by doing something like this:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
where input_df is a Pandas Dataframe
I want to convert this list of Series objects back to Pandas Dataframe object, and was wondering if there is some easy way to do it
Based on the post you can do this by doing:
pd.DataFrame(li)
To everyone suggesting pd.concat, this is not a Series anymore. They are adding values to a list and the data type for li is a list. So to convert the list to dataframe then they should use pd.Dataframe(<list name>).
Since the right answer has got hidden in the comments, I thought it would be better to mention it as an answer:
pd.concat(li, axis=1).T
will convert the list li of Series to DataFrame
It seems that you wish to perform a customized melting of your dataframe.
Using the pandas library, you can do it with one line of code. I am creating below the example to replicate your problem:
import pandas as pd
input_df = pd.DataFrame(data={'1': [1,2,3,4,5]
,'2': [1,2,3,4,5]
,'3': [1,2,3,4,5]
,'4': [1,2,3,4,5]
,'5': [1,2,3,4,5]})
Using pd.DataFrame, you will be able to create your new dataframe that melts your two selected lists:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
new_df = pd.DataFrame(li)
if what you want is that those two lists present themselves under one column, I would not pass them as list to pass those list back to dataframe.
Instead, you can just append those two columns disregarding the column names of each of those columns.
new_df = input_df.iloc[0].append(input_df.iloc[4])
Let me know if this answers your question.
The answer already mentioned, but i would like to share my version:
li_df = pd.DataFrame(li).T
If you want each Series to be a row of the dataframe, you should not use concat() followed by T(), unless all your values are of the same datatype.
If your data has both numerical and string values, then the transpose() function will mangle the dtypes, likely turning them all to objects.
The right way to do this in general is:
Convert each series to a dict()
Pass the list of dicts either into the pd.Dataframe() constructor directly or use pd.Dataframe.from_dicts and set the orient keyword to "index."
In your case the following should work:
my_list_of_dicts = [s.to_dict() for s in li]
my_df = pd.Dataframe(my_list_of_dicts)

How to read a column of 'dictionary' into pandas

I have a csv file with a column whose type is a dictionary (column 'b' in the example below). But b in df is a string type even though I define it as the dictionary type. I didn't find the solution to this question. Any suggestions?
a = pd.DataFrame({'a': [1,2,3], 'b':[{'a':3, 'haha':4}, {'c':3}, {'d':4}]})
a.to_csv('tmp.csv', index=None)
df = pd.read_csv('tmp.csv', dtype={'b':dict})
I wonder if your CSV column is actually meant to be a Python dict column, or rather JSON. If it's JSON, you can read the column as dtype=str, then use json_normalize() on that column to explode it into multiple columns. This is an efficient solution assuming the column contains valid JSON.
There is no dictionary type in pandas. So you should specify object in case you want normal Python objects:
df = pd.read_csv('tmp.csv', dtype={'b':object})
This will contain strings because pandas doesn't know what dictionaries are. In case you want dictionaries again you could try to "eval" them with ast.literal_eval (safe string evaluation):
df['b'] = df['b'].apply(ast.literal_eval)
print(df['b'][0]['a']) # 3
If you're really confident that you never ever run this on untrusted csvs you could also use eval. But before you consider that I would recommend trying to use a DataFrame only with "native" pandas or NumPy types (or maybe a DataFrame in DataFrame approach). Best to avoid object dtypes as much as possible.
You could use the converters parameter.
From the documentation:
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels.
If you know your column is well formed and includes no missing values then you could do:
df = pd.read_csv('tmp.csv', converters = {'b': ast.literal_eval})
However, for safety (as others have commented) you should probably define your own function with some basic error resilience:
def to_dict(x):
try:
y = ast.literal_eval(x)
if type(y) == dict:
return y
except:
return None
and then:
df = pd.read_csv('tmp.csv', converters = {'b': to_dict})

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Categories

Resources