Pandas df.loc comparing-floats-condition never works - python

df[['gc_lat', 'gc_lng']] = df[['gc_lat', 'gc_lng']].apply(pd.to_numeric, errors='ignore')
df_realty[['lat', 'lng']] = df_realty[['lat', 'lng']].apply(pd.to_numeric, errors='ignore')
for index, row in df.iterrows():
gc_lat = float(df.get_value(index,'gc_lat'))
gc_lng = float(df.get_value(index, 'gc_lng'))
latmax = gc_lat + 1/110.574*radius_km
latmin = gc_lat - 1/110.574*radius_km
longmax = gc_lng + 1/111.320*radius_km*cos(df.get_value(index,'gc_lat'))
longmin = gc_lng - 1/111.320*radius_km*cos(df.get_value(index,'gc_lat'))
print(latmax, latmin, longmax, longmin)
print (gc_lat)
print (gc_lng)
print (df_realty.shape)
subset = df_realty.loc[(df_realty['lat']<latmax) & (df_realty['lat']>latmin) & (df_realty['lng']>longmin) & (df_realty['lng'] <longmax)]
print (subset.shape)
print ('subset selected!')
prints
59.12412758664786 59.03369041335215 37.88659685779323 37.960157142206775
59.078909
37.923377
(290584, 3)
(0, 3)
subset selected!
So I am trying to split Dataframe to subsets, but the condition I put in df.loc never works!
The data in df_realty is OK, already tested.
It seems like I have to explict some type casts, but i've already made one (pd.to_numeric)
Any suggestions?

Found a solution
The problem was that the longmax sometimes became smaller than longmin because cos sometimes returns negative float.
puting abs() in front of cosinus solved the problem

Related

Dividing each column in a pandas df by a value from another df

I have a dataframe of a size (44,44) and another one (44,)
I need to divide each item in a column 'EOFx' by a number in a column 'PCx'.
(e.g. All values in 'EOF1' by 'PC1')
I've been trying string and numeric loops but nothing seems to work at all (error) or I get NaNs.
Last thing I tried was
for k in eof_df.keys():
for m in pc_df.keys():
eof_df[k].divide(pc_df[m])
The end result is a modified eof_df.
What did work for 1 column outside the loop is this.
eof_df.iloc[:,0].divide(std_df.iloc[0]).head()
Thank you!
upd1. In response to MoRe:
for eof_df it will be:
{'EOF1': {'8410140.nc': -0.09481700372712784,
'8418150.nc': -0.11842440098461708,
'8443970.nc': -0.1275311990493338,
'8447930.nc': -0.1321116945944401,
'8449130.nc': -0.11649753033608201,
'8452660.nc': -0.14776686151828214,
'8454000.nc': -0.1451132595405897,
'8461490.nc': -0.17032364516557338,
'8467150.nc': -0.20725618455428937,
'8518750.nc': -0.2249648853806308},
'EOF2': {'8410140.nc': 0.051213689088367806,
'8418150.nc': 0.0858110390036938,
'8443970.nc': 0.09029173023479754,
'8447930.nc': 0.05526955432871537,
'8449130.nc': 0.05136680082838883,
'8452660.nc': 0.06105351220962777,
'8454000.nc': 0.052112043784544135,
'8461490.nc': 0.08652511173850089,
'8467150.nc': 0.1137754089944319,
'8518750.nc': 0.10461193696203},
and it goes to EOF44.
For pc_df it will be
{'PC1': 0.5734671652560537,
'PC2': 0.29256502033278076,
'PC3': 0.23586098119374838,
'PC4': 0.227069130368915,
'PC5': 0.1642170373016029,
'PC6': 0.14131097046499339,
'PC7': 0.09837935104899741,
'PC8': 0.0869056762311067,
'PC9': 0.08183389338415169,
'PC10': 0.07467191608481094}
output = pd.DataFrame(index=eof_df.index, data=eof_df.values / pc_df.values)
output.columns = eof_df.columns
data = pd.DataFrame(eof_df.values.T / pc_df.values.T).T
data.columns = ["divided" + str(i + 1) for i in data.columns.to_list()]

How to add a new column to pandas dataframe while iterate over the rows?

I want to generate a new column using some columns that already exists.But I think it is too difficult to use an apply function. Can I generate a new column (ftp_price here) when iterating through this dataframe? Here is my code. When I call product_df['ftp_price'],I got a KeyError.
for index, row in product_df.iterrows():
current_curve_type_df = curve_df[curve_df['curve_surrogate_key'] == row['curve_surrogate_key_x']]
min_tmp_df = row['start_date'] - current_curve_type_df['datab_map'].apply(parse)
min_tmp_df = min_tmp_df[min_tmp_df > timedelta(days=0)]
curve = current_curve_type_df.loc[min_tmp_df.idxmin()]
tmp_diff = row['end_time'] - np.array(row['start_time'])
if np.isin(0, tmp_diff):
idx = np.where(tmp_diff == 0)
col_name = COL_NAMES[idx[0][0]]
row['ftp_price'] = curve[col_name]
else:
idx = np.argmin(tmp_diff > 0)
p_plus_one_rate = curve[COL_NAMES[idx]]
p_minus_one_rate = curve[COL_NAMES[idx - 1]]
d_plus_one_days = row['start_date'] + rate_mapping_dict[COL_NAMES[idx]]
d_minus_one_days = row['start_date'] + rate_mapping_dict[COL_NAMES[idx - 1]]
row['ftp_price'] = p_minus_one_rate + (p_plus_one_rate - p_minus_one_rate) * (row['start_date'] - d_minus_one_days) / (d_plus_one_days - d_minus_one_days)
An alternative to setting new value to a particular index is using at:
for index, row in product_df.iterrows():
product_df.at[index, 'ftp_price'] = val
Also, you should read why using iterrows should be avoided
A row can be a view or a copy (and is often a copy), so changing it would not change the original dataframe. The correct way is to always change the original dataframe using loc or iloc:
product_df.loc[index, 'ftp_price'] = ...
That being said, you should try to avoid to explicitely iterate the rows of a dataframe when possible...

Python - ValueError: could not broadcast input array from shape (5) into shape (2)

I have written some code which takes in my dataframe which consists of two columns - one is a string and the other is an idea count - the code takes in the dataframe, tries several delimeters and cross references it with the count to check it is using the correct one. The result I am looking for is to add a new column called "Ideas" which contains the list of broken out ideas. My code is below:
def getIdeas(row):
s = str(row[0])
ic = row[1]
# Try to break on lines ";;"
my_dels = [";;", ";", ",", "\\", "//"]
for d in my_dels:
ideas = s.split(d)
if len(ideas) == ic:
return ideas
# Try to break on numbers "N)"
ideas = re.split(r'[0-9]\)', s)
if len(ideas) == ic:
return ideas
ideas = []
return ideas
# k = getIdeas(str_contents3, idea_count3)
xl = pd.ExcelFile("data/Total Dataset.xlsx")
df = xl.parse("Sheet3")
df1 = df.iloc[:,1:3]
df1 = df1.loc[df1.iloc[:,1] != 0]
df1["Ideas"] = df1.apply(getIdeas, axis=1)
When I run this I am getting an error
ValueError: could not broadcast input array from shape (5) into shape (2)
Could someone tell me how to fix this?
You have 2 option with apply with axis=1, ether you return a single value or a list of length that match the length your number of columns. if you match the number of columns in will be broadcast to the entire row. if you return a single value it will return a pandas Series
one work around would be not to use apply.
result = []
for idx, row in df1.iterrows():
result.append(getIdeas(row))
df1['Ideas'] = result

pandas fillna is not working on subset of the dataset

I want to impute the missing values for df['box_office_revenue'] with the median specified by df['release_date'] == x and df['genre'] == y .
Here is my median finder function below.
def find_median(df, year, genre, col_year, col_rev):
median = df[(df[col_year] == year) & (df[col_rev].notnull()) & (df[genre] > 0)][col_rev].median()
return median
The median function works. I checked. I did the code below since I was getting some CopyValue error.
pd.options.mode.chained_assignment = None # default='warn'
I then go through the years and genres, col_name = ['is_drama', 'is_horror', etc] .
i = df['release_year'].min()
while (i < df['release_year'].max()):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
print(i)
i += 1
However, nothing changed!
len(df['box_office_revenue'].isnull())
The output was 35527. Meaning none of the null values in df['box_office_revenue'] had been filled.
Where did I go wrong?
Here is a quick look at the data: The other columns are just binary variables
You mentioned
I did the code below since I was getting some CopyValue error...
The warning is important. You did not give your data, so I cannot actually check, but the problem is likely due to:
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(..)
Let's break this down:
First you select some rows with:
df[(df['release_year'] == i) & (df[genre] > 0)]
Then from that, you select a columns with:
...['box_office_revenue']
And now you have a problem...
Why?
The problem is that when you selected some rows (ie: not all), pandas was forced to create a copy of your dataframe. You then select a column of the copy!. Then you fillna() on the copy. Not super useful.
How do I fix it?
Select the column first:
df['box_office_revenue'][(df['release_year'] == i) & (df[genre] > 0)].fillna(..)
By selecting the entire column first, pandas is not forced to make a copy, and thus subsequent operations should work as desired.
This is not elegant, but I think it works. Basically, I calculate the means conditioned on genre and year, and then join the data to a dataframe containing the imputing values. Then, wherever the revenue data is null, replace the null with the imputed value
import pandas as pd
import numpy as np
#Fake Data
rev = np.random.normal(size = 10_000,loc = 20)
rev_ix = np.random.choice(range(rev.size), size = 100 )
rev[rev_ix] = np.NaN
year = np.random.choice(range(1950,2018), replace = True, size = 10_000)
genre = np.random.choice(list('abc'), size = 10_000, replace = True)
df = pd.DataFrame({'rev':rev,'year':year,'genre':genre})
imputing_vals = df.groupby(['year','genre']).mean()
s = df.set_index(['year','genre'])
s.rev.isnull().any() #True
#Creates dataframe with new column containing the means
s = s.join(imputing_vals, rsuffix = '_R')
s.loc[s.rev.isnull(),'rev'] = s.loc[s.rev.isnull(),'rev_R']
new_df = s['rev'].reset_index()
new_df.rev.isnull().any() #False
This URL describing chained assignments seems useful for such a case: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters
As seen in above URL:
Hence, instead of doing (in your 'for' loop):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
You can try:
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df.loc[(df['release_year'] == i) & (df[genre] > 0) & (df['box_office_revenue'].isnull()), 'box_office_revenue'] = median

Python: xlrd discerning dates from floats

I wanted to import a file containing text, numbers and dates using xlrd on Python.
I tried something like:
if "/" in worksheet.cell_value:
do_this
else:
do_that
But that was of no use as I latter discovered dates are stored as floats, not strings. To convert them to datetime type I did:
try:
get_row = str(datetime.datetime(*xlrd.xldate_as_tuple(worksheet.cell_value(i, col - 1), workbook.datemode)))
except:
get_row = unicode(worksheet.cell_value(i, col - 1))
I have an exception in place for when the cell contains text. Now i want to get the numbers as numbers and the dates as dates, because right now all numbers are converted to dates.
Any ideas?
I think you could make this much simpler by making more use of the tools available in xlrd:
cell_type = worksheet.cell_type(row - 1, i)
cell_value = worksheet.cell_value(row - 1, i)
if cell_type == xlrd.XL_CELL_DATE:
# Returns a tuple.
dt_tuple = xlrd.xldate_as_tuple(cell_value, workbook.datemode)
# Create datetime object from this tuple.
get_col = datetime.datetime(
dt_tuple[0], dt_tuple[1], dt_tuple[2],
dt_tuple[3], dt_tuple[4], dt_tuple[5]
)
elif cell_type == xlrd.XL_CELL_NUMBER:
get_col = int(cell_value)
else:
get_col = unicode(cell_value)
Well, never mind, I found a solution and here it is!
try:
cell = worksheet.cell(row - 1, i)
if cell.ctype == xlrd.XL_CELL_DATE:
date = datetime.datetime(1899, 12, 30)
get_ = datetime.timedelta(int(worksheet.cell_value(row - 1, i)))
get_col2 = str(date + get_)[:10]
d = datetime.datetime.strptime(get_col2, '%Y-%m-%d')
get_col = d.strftime('%d-%m-%Y')
else:
get_col = unicode(int(worksheet.cell_value(row - 1, i)))
except:
get_col = unicode(worksheet.cell_value(row - 1, i))
A bit of explanation: it turns out that with xlrd you can actually check the type of a cell and check if it's a date or not. Also, Excel seems to have a strange way to save daytimes. It saves them as floats (left part for days, right part for hours) and then it takes a specific date (1899, 12, 30, seems to work OK) and adds the days and hours from the float to create the date. So, to create the date that I wanted, I just added them them and kept only the 10 first letters ([:10]) to get rid of the hours(00.00.00 or something...). I also changed the order of days_months-years because in Greece we use a different order. Finally, this code also checks if it can convert a number to an integer(I don't want any floats to show at my program...) and if everything fails, it just uses the cell as it is(in cases there are strings in the cells...).
I hope that you find that useful, I think there are other threads that say that this is impossible or something...

Categories

Resources