TypeError: can't convert type 'NoneType' to numerator/denominator - python

Here I try to calculate mean value based on the data in two list of dicts. Although I used same code before, I keep getting error. Is there any solution?
import pandas as pd
data = pd.read_csv('data3.csv',sep=';') # Reading data from csv
data = data.dropna(axis=0) # Drop rows with null values
data = data.T.to_dict().values() # Converting dataframe into list of dictionaries
newdata = pd.read_csv('newdata.csv',sep=';') # Reading data from csv
newdata = newdata.T.to_dict().values() # Converting dataframe into list of dictionaries
score = []
for item in newdata:
score.append({item['Genre_Name']:item['Ranking']})
from statistics import mean
score={k:int(v) for i in score for k,v in i.items()}
for item in data:
y= mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(','))))
print(y)
Too see csv files: https://repl.it/#rmakakgn/SVE2

.get method of dict return None if given key does not exist and statistics.mean fail due to that, consider that
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
print(statistics.mean(data))
result in:
TypeError: can't convert type 'NoneType' to numerator/denominator
You need to remove Nones before feeding into statistics.mean, which you can do using list comprehension:
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
data = [i for i in data if i is not None]
print(statistics.mean(data))
or filter:
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
data = filter(lambda x:x is not None,data)
print(statistics.mean(data))
(both snippets above code will print 2)
In this particular case, you might get filter effect by replacing:
mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(','))))
with:
mean([i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None])
though as with most python built-in and standard library functions accepting list as sole argument, you might decide to not build list but feed created generator directly i.e.
mean(i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None)
For further discussion see PEP 202 xor PEP 289.

Related

How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary PYTHON?

I need to convert the ‘content’ column from a string dictionary to a dictionary in python. After that I will use the following line of code:
df[‘content’].apply(pd.Series).
To have the dictionary values as a column name and the dictionary value in a cell.
I can’t do this now because there are missing values in the dictionary string.
How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary?
[I'm working on the 'content' column that I want to convert to the correct format first, I tried with the eval() function, but it doesn't work, because there are missing values. This is json data.
My goal is to have the content column data for the keys in the column titles and the values in the cells](https://i.stack.imgur.com/1CsIl.png)
you can use json.loads in lambda function. if row value is nan, pass, if not, apply json.loads:
:
import json
import numpy as np
df['content']=df['content'].apply(lambda x: json.loads(x) if pd.notna(x) else np.nan)
now you can use pd.Series.
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)
if you have missing values in string dictionaries:
def check_json(x):
import ast
import json
if pd.isna(x):
return np.nan
else:
try:
return json.loads(x)
except:
try:
mask=x.replace('{','').replace('}','') #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
return json.loads(str({','.join(mask)}).replace("\'", ""))
except:
try:
x=x.replace("\'", "\"")
mask=x.replace('{','').replace('}',"") #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
b=str({','.join(mask)}).replace("\'", "")
return ast.literal_eval(b)
except:
print("Could not parse json object. Returning nan")
return np.nan
df['content']=df['content'].apply(lambda x: check_json(x))
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)
I cannot see what the missing values look like in your screenshot, but i tested the following code and got what seems to be a good result. The simple explanation in to use str.replace() to fix the null values before parsing the string to dict.
import pandas as pd
import numpy as np
import json
## setting up an example dataframe. note that row2 has a null value
json_example = [
'{"row1_key1":"row1_value1","row1_key2":"row1_value2"}',
'{"row2_key1":"row2_value1","row2_key2": null}'
]
df= pd.DataFrame()
df['Content'] = json_example
## using string replace on the string representation of the json to clean it up
df['Content'].apply(lambda x: x.replace('null','"0"'))
## using lambda x to first load the string into a dict, then applying pd.Series()
df['Content'].apply(lambda x: pd.Series(json.loads(x)))
Output

How to extract the data from a list of dictionaries?

I'm collecting some market data from Binance's API. My goal is to collect the list of all markets and use the 'status' key included in each row to detect if the market is active or not. If it's not active, I must search the last trade to collect the date of the market's shutdown.
I wrote this code
import requests
import pandas as pd
import json
import csv
url = 'https://api.binance.com/api/v3/exchangeInfo'
trade_url = 'https://api.binance.com/api/v3/trades?symbol='
response = requests.get(url)
data = response.json()
df = data['symbols'] #list of dict
json_data=[]
with open(r'C:\Users\Utilisateur\Desktop\json.csv', 'a' , encoding='utf-8', newline='') as j :
wr=csv.writer(j)
wr.writerow(["symbol","last_trade"])
for i in data['symbols'] :
if data[i]['status'] != "TRADING" :
trades_req = requests.get(trade_url + i)
print(trades_req)
but I got this error
TypeError: unhashable type: 'dict'
How can I avoid it?
That's because i is a dictionary. If data['symbols'] is a list of dictionaries, when you do in the loop:
for i in data['symbols']:
if data[i]['status'] ...
you are trying to hash i to use it as a key of data. I think you want to know the status of each dictionary on the list. That is:
for i in data['symbols']:
if i['status'] ...
In such a case, it would be better to use more declarative variable names, e.g., d, s, symbol instead of i.

Datetime storing in hd5 database

I have a list of np.datetime64 data that looks as follows:
times =[2015-03-26T16:02:42.000000Z,
2015-03-26T16:02:45.000000Z,...]
type(times) returns list
type(times[1]) returns obspy.core.utcdatetime.UTCDateTime
Now, I understand that h5py does not support date time data.
I have tried the following:
time_str = [n.encode("ascii", "ignore") for n in time_str]
time_str = [str(s) for s in time_str]
type(time_str[1]) returns bytes
I am okay with creating the dataset and storing these date time values as a string
However, when attempting to create the dataset, I get the following error:
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=time_str,maxshape=(None),chunks=True, dtype='str')
TypeError: No conversion path for dtype: dtype('<U')
Where am I messing up/ is there an alternative way to store these values as is so I can extract them later?
Ok, here we go. I couldn't get some of you code to work together (maybe you left some steps out, or changed variable names?). And, I could not get the obspy.core.utcdatetime.UTCDateTime object your have.
So I created an example that does the following:
Starts with a list of np.datetime64() objects,
Converts to a list of np.datetime_as_string() in UTC format
objects **see note at Item 4
Converts to a np.array with dtype='S30'
Note: I included Step 2 to replicate your data. See following section
for simpler version
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
utc_times = [ np.datetime_as_string(n,timezone='UTC') for n in times ]
utc_str_arr = np.array(utc_times,dtype='S30')
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=utc_str_arr,maxshape=(None),chunks=True)
You can simplify the process if you are starting with np.datetime64() objects, and don't have (and don't need or want) the intermediate list of string objects (variable utc_times in my code). The method below skips Step 2 above, and shows 2 ways to create a np.array() of properly encoded strings.
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
# Create empty array with defined size and 'S#' dtype, then populate with for loop:
utc_str_arr1 = np.empty((len(times),),dtype='S30')
for i, n in enumerate(times):
utc_str_arr1[i] = np.datetime_as_string(n,timezone='UTC')
# -OR- Create array and populate using loop comprehension:
utc_str_arr2 = np.array( [np.datetime_as_string(n,timezone='UTC').encode('utf-8') for n in times] )
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time1", data=utc_str_arr1,maxshape=(None),chunks=True)
f.create_dataset("time2", data=utc_str_arr2,maxshape=(None),chunks=True)
Final result looks similar with either method (second method creates 2 identical datsets).
Image from HDFView:
To Read the Data:
Per request in Aug-02-2021 comment, here is the code to extract data from HDF5 and create Pandas timestamp objects (then saved to a dataframe). First the byte strings in the dataset are read and converted to NumPy Unicode strings with .astype(). Then the strings are converted to Pandas timestamp objects with pd.to_datetime() using the format= parameter.
import h5py
import numpy as np
import pandas as pd
with h5py.File('data_ML.hdf5', 'r') as h5f:
## returns a h5py dataset object:
dts_ds = h5f["time"]
longest_word=len(max(dts_ds, key=len))
## returns an array of byte strings representing np.datetime64:
## .astype() used to convert byte strings to unicode
dts_arr = dts_ds[:].astype('U'+str(longest_word))
## create a new array to hold Pandas datetime objects
## then loop over first array to convert and populate new array
pd_dts_arr = np.empty((dts_arr.shape[0],),dtype=object)
for i, dts in enumerate(dts_arr):
pd_dts_arr[i] = pd.to_datetime(dts, format='%Y-%m-%dT%H:%M:%S.%fZ')
dts_df = pd.DataFrame(pd_dts_arr)
There are a lot of ways to represent dates and time using native Python, NumPy and Pandas objects. More details about working with them can be found at this answer:
Converting between datetime, Timestamp and datetime64

Convert list of strings to list of json objects in pyspark

def tranform_doc(docs):
json_list = []
print(docs)
for doc in docs:
json_doc = {}
json_doc["customKey"] = doc
json_list.append(json_doc)
return json_list
df.groupBy("colA") \
.agg(custom_udf(collect_list(col("colB"))).alias("customCol"))
First Hurdle:
Input: ["str1","str2","str3"]
Output: [{"customKey":"str1"},{"customKey":"str2"},{"customKey":"str3"}]
Second Hurdle:
columns in agg collect_list are changing dynamically. So, how to adjust schema dynamically.
when elements in list changes, receiving an error
Input row doesn't have expected number of values required by the schema. 1 fields are required while 3 values are provided
What I did:
def tranform_doc(agg_docs):
return json_list
## When I failed to get a list of JSON I tried just return the original list of strings to the list of json
schema = StructType([{StructField("col1",StringType()),StructField("col2",StringType()),StructField("col3",StringType())}])
custom_udf = udf(tranform_doc,schema)
df.groupBy("colA") \
.agg(custom_udf(collect_list(col("colB"))).alias("customCol"))
Output I got:
{"col2":"str1","col1":"str2","col3":"str3"}
Struggling to get the required list of JSON strings and to make it dynamical to number of elements in the list
No UDF needed. You can convert colB to a struct before collect_list.
import pyspark.sql.functions as F
df2 = df.groupBy('colA').agg(
F.to_json(
F.collect_list(
F.struct(F.col('colB').alias('customKey'))
)
).alias('output')
)

Return unknown string in dataframe (extract unkown string from

I have a large dataset which I have imported using the read_csv as described below which should be float measurement and NaN.
df = pd.read_csv(file_,parse_dates=[['Date','Time']],na_values = ['No Data','Bad Data','','No Sample'],low_memory=False)
When I apply df.dtypes, most of the columns return as object type which indicate that there are other objects in the dataframe that I am not aware of.I am looking for a way of identifying those string and replace then by na values.
First thing that I wanted to do was to convert everything to dtype = np.float but I couldn't. Then, I tried to read in each (columns,index) and return the identified string.
I have tried something very inefficient (I am a beginner) and time consuming, it has worked for other dataframe but here it returns a errors:
TypeError: argument of type 'float' is not iterable
from isstring import *
list_string = []
for i in range(0,len(df)):
for j in range(0,len(df.columns)):
x = test.ix[i,j]
if isstring(x) and '.'not in x:
list_string.append(x)
list_string = pd.DataFrame(list_string, columns=["list_string"])
g = list_string.groupby('list_string').size()
Is there a simple way of detecting unknown string in large dataset. Thanks
You could try:
string_list = []
for col, series in df.items(): # iterating over all columns - perhaps only select `object` types
string_list += [s for s in series.unique() if isinstance(s, str)]

Categories

Resources