combining multiple columns into one by checking for NaN - python

I'm struggling on the following case. I have a dataframe with columns containing NaN or a string value. How could I merge all 3 columns (i.e. Q3_4_4, Q3_4_5, and Q3_4_6 into one new column (i.e Q3_4) by only keeping the string value? This column would have 'hi' in row 1, 'bye' in row 2, and 'hello' in row3.
Thank you for your help
{'Q3_4_4': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': 'hello'
},
'Q3_4_5': {'R_00RfS8OP6QrTNtL': 'hi',
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': nan},
'Q3_4_6': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': 'bye',
'R_3G2sp6TEXZmf2KI': nan},
}

If need join by columns names with removed last value after last _ in columns names use GroupBy.first per axis=1 (per columns) with lambda for select by columns names spiller from right by first _ and selecting:
nan = np.nan
df = pd.DataFrame({'Q3_4_4': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': 'hello'
},
'Q3_4_5': {'R_00RfS8OP6QrTNtL': 'hi',
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': nan},
'Q3_4_6': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': 'bye',
'R_3G2sp6TEXZmf2KI': nan},
})
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).first()
print (df)
Q3_4
R_00RfS8OP6QrTNtL hi
R_3JtmbtdPjxXZAwA bye
R_3G2sp6TEXZmf2KI hello

df.apply(lambda x : x[x.last_valid_index()], 1)
More methods: https://stackoverflow.com/a/46520070/8170215

Related

How can I define NA values in a list and check pandas dataframe?

In my dataframe, a value should be classed as missing if it is "nan", 0 or "missing".
I wrote a list
null_values = ["nan", 0, "missing"]
Then I checked which columns contained missing values
df.columns[df.isin(null_values).any()]
My df looks as follows:
{'Sick': {0: False, 1: False, 2: False, 3: False, 4: False, 5: False, 6: True},
'Name': {0: 'Liam',
1: 'Chloe',
2: 'Andy',
3: 'Max',
4: 'Anne',
5: nan,
6: 'Charlie'},
'Age': {0: 1.0, 1: 2.0, 2: 8.0, 3: 4.0, 4: 5.0, 5: 2.0, 6: 9.0}}
It flags the column 'Sick' as containing missing values even though it only contains FALSE/TRUE values. However, it correctly recognises that Age has no missing values. Why does it count FALSE as a missing value when I have not defined it as such in my list?
One idea is exclude boolean columns by DataFrame.select_dtypes:
null_values = ["nan", 0, "missing"]
df1 = df.select_dtypes(exclude=bool)
print (df1.columns[df1.isin(null_values).any()])
Index([], dtype='object')
More info is here.
If need also compare NaN missing values add it to list:
null_values = ["nan", 0, "missing", np.nan]
df1 = df.select_dtypes(exclude=bool)
print (df1.columns[df1.isin(null_values).any()])
Index(['Name'], dtype='object')
EDIT: One trick is convert all values to strings, then need compare '0' for strings from integers and '0.0' for strings from floats:
null_values = ["nan", '0', '0.0', "missing"]
print (df.columns[df.astype(str).isin(null_values).any()])
Index(['Name'], dtype='object')

Extract strings from a Dataframe looping over a single row

I'm reading multiple PDFs (using tabula) into data frames like this:
nan = float('nan')
DataFrame_as_dict = {'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 1': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'NOTA DE NEGOCIAÇÃO': {0: nan,
1: nan,
2: 'Rico Investimentos - Grupo XP',
3: 'Av. Presidente Juscelino Kubitschek - Torre Sul, 1909 - 25o ANDAR VILA OLIMPIA 4543-907',
4: 'Tel. 3003-5465Fax: (55 11) 4007-2465',
5: 'Internet: www.rico.com.vc SAC: 0800-774-0402 e-mail: atendimento#rico.com.vc'},
'Unnamed: 3': {0: 'Nr. nota Folha',
1: '330736 1',
2: nan,
3: 'SÃO PAULO - SP',
4: nan,
5: nan},
'Unnamed: 4': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 5': {0: 'Data pregão',
1: '30/09/2015',
2: nan,
3: nan,
4: nan,
5: nan}}
df = pd.DataFrame(DataFrame_as_dict)
dataframe figure
My intention is to use that value '330736 1' into the variable "number" and '30/09/2015' into a variable "date".
The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.
Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1:2,i]).strip()
else:
n_nota = str(df.iloc[1:2,i]).strip()
However, without success... Any thoughts?
In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:
Strings inside DataFrames are of type object
df.iloc[1:2,i] will always be a pandas Series.
Since object is such a flexible type, it's not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn't work with your data, a regex pattern may be a good approach.
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i].values, object):
(df.iloc[1:2,i].values)
if "/" in str(df.iloc[1:2,i].values):
date = str(df.iloc[1:2,i].values[0]).strip()
elif " " in str(df.iloc[1:2,i].values):
n_nota = str(df.iloc[1:2,i].values[0]).strip()
Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1,i]).strip()
else:
n_nota = str(df.iloc[1,i]).strip()

Calculating numeric list with string values

I have a numeric list with NaN values and I want to apply mathematical functions to it. Also I need keep those NaN values to be stored still after computation
list_a = [1827.07, 1376.21, nan, nan, 1001.88, 978.07]
recal_list = []
for i in list_a:
time = round(i/55)
recal_list.append(time)
You could use a pandas Series
from pandas import Series
from numpy import nan
list_a = [1827.07, 1376.21, nan, nan, 1001.88, 978.07]
result = round(Series(list_a) / 55)
print(result.tolist()) # [33.0, 25.0, nan, nan, 18.0, 18.0]
Or your solution, with an if
from numpy import nan, isnan
list_a = [1827.07, 1376.21, nan, nan, 1001.88, 978.07]
recal_list = []
for val in list_a:
recal_list.append(val if isnan(val) else round(val / 55))
print(recal_list) # [33.0, 25.0, nan, nan, 18.0, 18.0]

How to map functions in dask

I'm using Dask to manipulate a dataframe (coming from CSV file) and I'm looking for a way to improve this code using something like map, or apply functions since in large files is taking so long (I know having nested for and using iterrows() is the worst think I can make)
NAN_VALUES = [-999, "INVALID", -9999]
_all_rows=list()
for index, row in df.iterrows():
_row = list()
for key, value in row.iteritems():
if value in NAN_VALUES or pd.isnull(value):
_row.append(None)
else:
_row.append(apply_transform(key, value))
_all_rows.append(_row)
rows_count += 1
How can I map this code using map_partitions or pandas.map ?!
EXTRA: a bit more of context:
In order to be able to apply some functions I'm replacing NaN values with a default value. Finally I need to make a list for each row replacing the default values to None.
1.- Original DF
"name" "age" "money"
---------------------------
"David" NaN 12.345
"Jhon" 22 NaN
"Charles" 30 123.45
NaN NaN NaN
2.- Passing NaN to Default value
"name" "age" "money"
------------------------------
"David" -999 12.345
"Jhon" 22 -9999
"Charles" 30 123.45
"INVALID" -999 -9999
3.- Parse to a list each row
"name" , "age", "money"
------------------------
["David", None, 12.345]
["Jhon", 22, None]
["Charles", 30, 123.45]
[None, None, None]
My suggestion here is try to work with pandas and then try to translate into dask
pandas
import pandas as pd
import numpy as np
nan = np.nan
df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}
df = pd.DataFrame(df)
# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}
Passing NaN to Default value
for k,v in diz.items():
df[k] = df[k].fillna(v)
Get a list for every row
df.apply(list, axis=1)
0 [David, nan, 12.345]
1 [John, 22.0, nan]
2 [Charles, 30.0, 123.45]
3 [nan, nan, nan]
dtype: object
dask
import pandas as pd
import dask.dataframe as dd
import numpy as np
nan = np.nan
df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}
df = pd.DataFrame(df)
# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}
# transform to dask dataframe
df = dd.from_pandas(df, npartitions=2)
Passing NaN to Default value
This is exactly the same as before. Note that as dask is lazy you should run if you want to see the effects df.compute()
for k,v in diz.items():
df[k] = df[k].fillna(v)
Get a list for every row
Here things change a bit as you are asked to state explicitly the dtype of your output
df.apply(list, axis=1, meta=(None, 'object'))
In dask you can eventually use map_partitions as following
df.map_partitions(lambda x: x.apply(list, axis=1))
Remark please consider that if your data fits in memory you don't need dask and pandas could be faster.

Remove nan values from a dict in python

I am trying to remove keys with nan values from a dictionary formed from pandas using python. Is there a way I can achieve this.
Here is a sample of my dictionary:
{'id': 1, 'internal_id': '1904', 'first_scraping_time': '2020-04-17 12:44:59.0', 'first_scraping_date': '2020-04-17', 'last_scraping_time': '2020-06-20 03:08:47.0', 'last_scraping_date': '2020-06-20', 'is_active': 1,'flags': nan, 'phone': nan,'size': 60.0, 'available': '20-06-2020', 'timeframe': nan, 'teaser': nan, 'remarks': nan, 'rent': 4984.0, 'rooms': '3', 'downpayment': nan, 'deposit': '14952', 'expenses': 600.0, 'expenses_tv': nan, 'expenses_improvements': nan, 'expenses_misc': nan, 'prepaid_rent': '4984', 'pets': nan, 'furnished': nan, 'residence_duty': nan, 'precision': nan, 'nearby_cities': nan,'type_dwelling': nan, 'type_tenants': nan, 'task_id': '614b8fc2-409c-403a-9650-05939e8a89c7'}
Thank you!
nan is a tricky object to work with because it doesn't equal (or even necessarily share object identity) with anything, including itself.
You can use math.isnan to test for it:
import math
new = {key: value for (key, value) in old.items() if not math.isnan(value)}

Categories

Resources