I am trying to remove keys with nan values from a dictionary formed from pandas using python. Is there a way I can achieve this.
Here is a sample of my dictionary:
{'id': 1, 'internal_id': '1904', 'first_scraping_time': '2020-04-17 12:44:59.0', 'first_scraping_date': '2020-04-17', 'last_scraping_time': '2020-06-20 03:08:47.0', 'last_scraping_date': '2020-06-20', 'is_active': 1,'flags': nan, 'phone': nan,'size': 60.0, 'available': '20-06-2020', 'timeframe': nan, 'teaser': nan, 'remarks': nan, 'rent': 4984.0, 'rooms': '3', 'downpayment': nan, 'deposit': '14952', 'expenses': 600.0, 'expenses_tv': nan, 'expenses_improvements': nan, 'expenses_misc': nan, 'prepaid_rent': '4984', 'pets': nan, 'furnished': nan, 'residence_duty': nan, 'precision': nan, 'nearby_cities': nan,'type_dwelling': nan, 'type_tenants': nan, 'task_id': '614b8fc2-409c-403a-9650-05939e8a89c7'}
Thank you!
nan is a tricky object to work with because it doesn't equal (or even necessarily share object identity) with anything, including itself.
You can use math.isnan to test for it:
import math
new = {key: value for (key, value) in old.items() if not math.isnan(value)}
Related
I'm struggling on the following case. I have a dataframe with columns containing NaN or a string value. How could I merge all 3 columns (i.e. Q3_4_4, Q3_4_5, and Q3_4_6 into one new column (i.e Q3_4) by only keeping the string value? This column would have 'hi' in row 1, 'bye' in row 2, and 'hello' in row3.
Thank you for your help
{'Q3_4_4': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': 'hello'
},
'Q3_4_5': {'R_00RfS8OP6QrTNtL': 'hi',
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': nan},
'Q3_4_6': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': 'bye',
'R_3G2sp6TEXZmf2KI': nan},
}
If need join by columns names with removed last value after last _ in columns names use GroupBy.first per axis=1 (per columns) with lambda for select by columns names spiller from right by first _ and selecting:
nan = np.nan
df = pd.DataFrame({'Q3_4_4': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': 'hello'
},
'Q3_4_5': {'R_00RfS8OP6QrTNtL': 'hi',
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': nan},
'Q3_4_6': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': 'bye',
'R_3G2sp6TEXZmf2KI': nan},
})
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).first()
print (df)
Q3_4
R_00RfS8OP6QrTNtL hi
R_3JtmbtdPjxXZAwA bye
R_3G2sp6TEXZmf2KI hello
df.apply(lambda x : x[x.last_valid_index()], 1)
More methods: https://stackoverflow.com/a/46520070/8170215
I have the following dataset (I will upload only a sample of 4 rows, the real one has 15,000 rows):
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from sklearn.feature_selection import chi2
quotes=["Sip N Shop Come thru right now Marjais PopularNobodies MMR Marjais SipNShop",
"I do not know about you but My family and I will not take the Covid19 vaccine anytime soon",
"MSignorile Immunizations should be mandatory Period In Oklahoma they will not let kids go to school without them It is dangerous otherwise",
"President Obama spoke in favor of vaccination for children Fox will start telling its viewers to choose against vaccination in 321"]
labels=[0,1,2,0]
dummy = pd.DataFrame({"quote": quotes, "label":labels})
And I want to apply the famous chi square test to eliminate the number of irrelevant words per category (0,1,2). Where 0: neutral, 1: positive, 2: negative.
Below is my approach (similar to the approach implemented here)
Briefly, I create an empty list of 0's equal to corpus length. 0's represent the first label of y = 0. For the second label (1=positive) I will create an empty list 1's. Similarly for the third label (2=negative).
After applying this 3 times (for each of the target labels), I will then have three 3 lists with the most dependent words per label. This final list will be my new vocabulary for the TF-IDF vectorizer.
def tweeter_tokenizer(tweet):
return tweet.split(' ')
vectorizer = TfidfVectorizer(tokenizer=tweeter_tokenizer, ngram_range=(1,2), stop_words=english_stopwords)
vectorizer.fit(dummy["quote"])
X_train = vectorizer.transform(dummy["quote"])
y_train = dummy["label"]
feature_names = vectorizer.get_feature_names_out()
y_neutral = np.array([0]*X_train.shape[0])
pValue = 0.90
chi_neutral, p_neutral = chi2(X_train, y_neutral)
chi_neutral
The chi_neutral object is:
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan])
At the end I want to create a dataframe equal to the length of unique tokens (feature_names) per label. And I will keep only the words with score > pValue. The dataframe will show me how many from the total tokens of the corpus are dependent to class 0 (neutral). The same approach will be followed for the rest of the labels (1: positive, 2: negative).
y_df = np.array([0]*X_train.shape[1])
tokens_neutral_dependent = pd.DataFrame({
"tweet_token": feature_names,
"chi2_score" : 1-p_neutral,
"neutral_label": y_df #length = length of feature_names()
})
tokens_neutral_dependent = tokens_neutral_dependent.sort_values(["neutral_label","chi2_score"], ascending=[True,False])
tokens_neutral_dependent = tokens_neutral_dependent[tokens_neutral_dependent["chi2_score"]>pValue]
tokens_neutral_dependent.shape
I don't think it's really meaningful to compute the chi-squared statistic without having the classes attached. The code chi2(X_train, y_neutral) is asking "Assuming that class and the parameter are independent, what are the odds of getting this distribution?" But all of the examples you're showing it are the same class.
I would suggest this instead:
chi_neutral, p_neutral = chi2(X_train, y_train)
If you're interested in chi-square statistics between particular classes, you can filter the dataset first to just two classes, then run the chi-squared test. But this step is not necessary.
I have some code that works, but I've been looking for ways to optimize the functionality and make it faster. When I run this code it could take up to 2 hours to finish on my computer. I'm working with two Dataframes: a big DF with over 1 million rows of car parts with SKU numbers, prices, etc; and as you can see in the code example, I'm reading from a CSV sheet that has over 300k rows.
This is my code:
npw_inventory = pd.read_csv('inventory/npw.csv', low_memory=False)
count = 0
for _, row in pw.iterrows():
if row['CS-SKU-NP'][5:] in npw_inventory['SKU'].values:
matched_npw = npw_inventory.loc[npw_inventory['SKU'] == row['CS-SKU-NP'][5:]]
min_qty = matched_npw['Min_order_Qty']
core = matched_npw['Core_cost']
row['MinPrice'] = (row['MinPrice'] + core) * min_qty
count += 1
print(count)
What I'm trying to answer is: 1) Is this the most efficient way to do this?, and 2) How can can I cut down the time it takes to run this code?
Data:
pw
{'WD': 'A', 'LC': 'WGL', 'MasterLC': 307.0, 'Part Number': 'H6545BL',
'MasterSKU': '307|H6545BL', 'Weight': 2.0, 'DimWeight': 2.0,
'ShipWeight': 2.0, 'Length': 7.0, 'Width': 5.0, 'Height': 5.0,
'Total': 27.0, 'DIM': 175.0, 'UPC': '042723936655', 'Qty': 0,
'CS-SKU': 'AWGL|H6545BL', 'CS-SKU-NP': 'AWGL|H6545BL', 'MinPrice':
11.53, 'Shipping': nan, 'Carrier': nan, 'Service': nan, 'Markup': nan, 'ShipMkup': nan, 'ListMkup': nan, 'PackQty': 1.0, 'MinQty': nan,
'MaxQty': nan, 'Zip Code': 90746.0, 'CatSKU': nan, 'OP-Lowest(Y)':
nan, 'VND-Lowest(Y)': nan, 'MinMkDown': nan, 'MaxMkUp': nan,
'Interval': nan, 'BundleSKU': nan, 'Duplicate': nan}
npw.csv
{'Line_code': '3MM', 'SKU': '00379', 'SKU_noDS': '00379', 'Stock_1':
0, 'Stock_11': 0, 'Stock_12': 0, 'Stock_13': 0, 'Stock_15': 25,
'Stock_16': 0, 'Stock_18': 0, 'Stock_19': 0, 'Stock_2': 0, 'Stock_3':
0, 'Stock_4': 0, 'Stock_7': 0, 'Stock_89': 0, 'Stock_9': 0,
'Stock_VIC-94': 0, 'Cost': 90.36, 'Core_cost': 0.0, 'Min_order_Qty':
1}
Basically what's happening is I'm running a script that's a Jupyter notebook. It loads the pw file and does various changes with pricing and shipping. For this specific part, I match SKUs in pw with SKUs in npw_inventory to get the Core_cost and Min_order_Qty. In pw I add core to the MinPrice of that SKU that was matched and multiply it by the min_qty.
I am trying to find a substring in below hard_skills_name column, like i want all rows which has 'Apple Products' as hard skill.
I tried below code:
df.loc[df['hard_skills_name'].str.contains("Apple Products", case=False)]
but getting this error:
KeyError Traceback (most recent call last)
<ipython-input-49-acdcdfbdfd3d> in <module>
----> 1 df.loc[df['hard_skills_name'].str.contains("Apple Products", case=False)]
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
877
878 maybe_callable = com.apply_if_callable(key, self.obj)
--> 879 return self._getitem_axis(maybe_callable, axis=axis)
880
881 def _is_scalar_access(self, key: Tuple):
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1097 raise ValueError("Cannot index with multidimensional key")
1098
-> 1099 return self._getitem_iterable(key, axis=axis)
1100
1101 # nested tuple slicing
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1035
1036 # A collection of keys
-> 1037 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1038 return self.obj._reindex_with_indexers(
1039 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1252 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1253
-> 1254 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
1255 return keyarr, indexer
1256
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1296 if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan],\n dtype='float64')] are in the [index]"
Try to chain (temporarily) conversion of the list of strings to comma separated strings by str.join() before string search:
df[df['hard_skills_name'].str.join(', ').str.contains("Apple Products", case=False)]
The problem was owing to the string you are going to search is contained within a list. You cannot search the string in list directly with .str.contains(). To solve it, you can convert the list of strings to a long string first (e.g. with commas separating the substrings) by .str.join() before doing your string search.
Your index has null values. You're going to have to make a boolean mask for this. Directly answering your question:
df.loc[(df.index.notnull()) & (df['hard_skills_name'].str.contains("Apple Products", case=False))]
This should exclude anything that has null index values and does contain the given string in hard_skills_name
However, I suspect that this will also exclude some data that you're looking for. The solution in that case would be to change your index to not have any NaNs. Whether that means replacing it with a placeholder value or creating a brand new index, that's up to you.
I have a numeric list with NaN values and I want to apply mathematical functions to it. Also I need keep those NaN values to be stored still after computation
list_a = [1827.07, 1376.21, nan, nan, 1001.88, 978.07]
recal_list = []
for i in list_a:
time = round(i/55)
recal_list.append(time)
You could use a pandas Series
from pandas import Series
from numpy import nan
list_a = [1827.07, 1376.21, nan, nan, 1001.88, 978.07]
result = round(Series(list_a) / 55)
print(result.tolist()) # [33.0, 25.0, nan, nan, 18.0, 18.0]
Or your solution, with an if
from numpy import nan, isnan
list_a = [1827.07, 1376.21, nan, nan, 1001.88, 978.07]
recal_list = []
for val in list_a:
recal_list.append(val if isnan(val) else round(val / 55))
print(recal_list) # [33.0, 25.0, nan, nan, 18.0, 18.0]