I have a number of Pandas DFs with differing format that should get reshaped into a common target-format.
Right now, I write dictionaries for each DF:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"original_name":["a","b","c"],"original_value":[1,2,3]})
key_dict = {
"name":df1.original_name,
"value":df1.original_value,
"other_value":np.nan
}
target_colnames = ["name","value","other_value"]
new_df = pd.DataFrame(key_dict, columns = target_colnames)
My Problem: The mapping of orginal to target columns with key_dict is stored in a CSV file (index= values, columns = key for each DF).
key_df= pd.read_csv("key_df.csv").set_index("key")
key_df= key_df.to_dict()
new_df = pd.DataFrame(key_df["df1"], columns = target_colnames)
This leads to the following error:
"If using all scalar values, you must pass an index"
I think it's because the values of 'key_df' are strings unlike in 'key_dict'. Do I need to apply 'eval' on the keys?
this is how 'key_df["df1"]' looks:
{'name': 'df1.original_name',
'other_value': 'np.nan',
'value': 'df1.original_value'}
Use:
key_df = {i:eval(j) for i,j in key_df.items()} # Use iteritems() for python 2
new_df = pd.DataFrame(key_dict, columns = target_colnames)
Output
name value other_value
a 1 NaN
b 2 NaN
c 3 NaN
Explanation
After loading and converting to csv to dict, you have to do a dict comprehension to convert the pd.Series() values stored as str to eval() so you can reuse the same new_df code to get what you want
Related
I have a pandas column called geo inside a dataFrame called df. It can be populated or not.
geo has an object, referencing an id of a place.
I'm acessing row 18 using df.iloc[18].geo and it returns:
"{'place_id': '1c37515518593fe3'}"
its a str type.
How i can create a new column inside df called place_id countaining the value (on my example: 1c37515518593fe3) ?
This post should help you convert string literals into dictionaries:
Convert a String representation of a Dictionary to a dictionary?
EDIT updated for possible null values
import pandas as pd
import numpy as np
import ast
df = pd.DataFrame(data=["{'place_id': '1c37515518593fe3'}", np.NaN], columns=["geo"])
df["geo"] = df["geo"].apply(lambda x: ast.literal_eval(x) if not pd.isnull(x) else x)
df['place_id'] = df['geo'].apply(lambda x: x.get('place_id', np.NaN) if not pd.isnull(x) else x)
print(df)
geo place_id
0 {'place_id': '1c37515518593fe3'} 1c37515518593fe3
1 NaN NaN
I have saved out a data column as follows:
[[A,1], [B,5], [C,18]....]
i was hoping to group A,B,C as shown above into Category and 1,5,18 into Values/Series for updating of my powerpoint chart using python pptx.
Example:
Category
Values
A
1
B
5
Is there any way i can do it? currently the above example is also extracted as strings so i believe i have to convert it to lists first?
thanks in advance!
Try to parse your strings (a list of lists) then create your dataframe from the real list:
import pandas as pd
import re
s = '[[A,1], [B,5], [C,18]]'
cols = ['Category', 'Values']
data = [row.split(',') for row in re.findall('\[([^]]+)\]', s[1:-1])]
df = pd.DataFrame(data, columns=cols)
print(df)
# Output:
Category Values
0 A 1
1 B 5
2 C 18
You should be able to just use pandas.DataFrame and pass in your data, unless I'm misunderstanding the question. Anyway, try:
df = pandas.DataFrame(data=d, columns = ['Category', 'Value'])
where d is your list of tuples.
from prettytable import PrettyTable
column = [["A",1],["B",5],["C",18]]
columnname=[]
columnvalue =[]
t = PrettyTable(['Category', 'Values'])
for data in column:
columnname.append(data[0])
columnvalue.append(data[1])
t.add_row([data[0], data[1]])
print(t)
I have a CSV column called ref_type as shown in the below screen shot with mixed types which are sometimes string and other rows as JSON. I am reading this CSV using pandas read_csv method which inherits the type as object
i would like to convert the JSON part as below
Please help to parse above scenario.
Thanks in Advance.
Found the solution and its not the best but its working.
I already have a flatten JSON function as below
def flatten_json_columns(df, json_cols, custom_df):
"""
This function flattens JSON columns to individual columns
It merges the flattened dataframe with expected dataframe to capture missing columns from JSON
:param df: CSV raw dataframe
:param json_cols: custom data columns in CSV's
:param custom_df: expected dataframe
:return: returns df pandas dataframe
"""
# Loop through all JSON columns
for column in json_cols:
if not df[column].isnull().all():
# Replace None and NaN with empty braces
df[column].fillna(value='{}', inplace=True)
# Deserialize's a str instance containing a JSON document to a Python object
df[column] = df[column].apply(json.loads)
# Normalize semi-structured JSON data into a flat table
column_as_df = pd.json_normalize(df[column])
# Extract main column name and attach it to each sub column name
column_as_df.columns = [f"{column}_{subcolumn}" for subcolumn in column_as_df.columns]
# Merge extracted result from custom_data field with expected fields
result_df = pd.merge(column_as_df, custom_df, how='left')
# Drop the temp column and merge the flattened dataframe with orginal dataframe
df = df.merge(result_df, right_index=True, left_index=True)
else:
df = pd.concat([df, custom_df], axis=1)
# Return dataframe with flatten columns
return df
my data frame looks like below
I created a another column called ref_type_json from ref_type by putting only json rows and ignoring all strings. instead of strings i returned none
ref_type_df['ref_type_json'] = [column if column[0] == '{' else None for column in ref_type_df['ref_type']]
now the ref_type_df looks as below
i also created empty expected data frame so that the output of flatten JSON function aligns with the out put of expected dataframe
ref_type_expected = {
'ref_type_json_fromNumber': [],
'ref_type_json_toNumber': [],
'ref_type_json_comment': []
}
ref_type_expected_df = pd.DataFrame.from_dict(ref_type_expected)
Finally, I invoked the flatten JSON function which converts the JSON to columns
result_df = flatten_json_columns(df=ref_type_df,
json_cols=['ref_type_json'],
custom_df=ref_type_expected_df)
result_df.drop('ref_type_json', axis=1)
my result data frame looks as below
Please let me know if you have a better solution for it.
I would just build a dataframe containing the new columns by hand and join it to the first one. Unfortunately you have not provided copyable data so I just used mine.
Original df:
df = pd.DataFrame({'ref': ['Outcomes', 'API-TEST', '{"from":"abc", "to": "def"}',
'Manual(add)', '{"from": "gh", "to": "ij"}', 'Migration']})
Giving:
ref
0 Outcomes
1 API-TEST
2 {"from":"abc", "to": "def"}
3 Manual(add)
4 {"from": "gh", "to": "ij"}
5 Migration
Extract only json data from ref column:
data = [] # future data of the dataframe
ix = [] # future index
cols = set() # future columns
for name, s in df[['ref']].iterrows():
try:
d = json.loads(s['ref'])
ix.append(name) # if we could decode feed the future dataframe
cols.update(set(d.keys()))
data.append(d)
except json.JSONDecodeError:
pass # else ignore the line
df = df.join(pd.DataFrame(data, ix, cols), how='left')
gives:
ref to from
0 Outcomes NaN NaN
1 API-TEST NaN NaN
2 {"from":"abc", "to": "def"} def abc
3 Manual(add) NaN NaN
4 {"from": "gh", "to": "ij"} ij gh
5 Migration NaN NaN
Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)
You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']
You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.
df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')
Simple one-liner:
df.select_dtypes('number').columns
Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']
This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index
We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'
Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.
Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.
def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)
I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)