I have two excels I coverted into dataframes.
DF1: contains columns 'JobKeys' and 'Aircraft Numbers' (Amongst other data)
DF2: contains columns 'JobKeys' and 'Shortage' (Amongst other data)
I want to create a column 'Short' in DF1 mapping for the jobkeys present in DF2 (effectively a VLOOKUP)
For both I set JobKey as the index:
#Import relevant libraries:
import pandas as pd
import numpy as np
DF1 = pd.read_excel('...')
DF2 = pd.read.excel('...')
DF1['Short'] = " "
DF1.set_index('JobKey', inplace = True)
DF2.set_index('JobKey', inplace = True)
Both OK. I print a sample of both using DF.head() and it looks OK. I want to use .index.map() what was done here:
https://towardsdatascience.com/vlookup-implementation-in-python-in-three-simple-steps-93b5a290fd72
DF1["Short"]=DF1.index.map(DF2["Shortage"])
However I get the error:
---------------------------------------------------------------------------
C:\ProgramData\Anaconda3\lib\site-packages\pandas\indexes\base.py in map(self, mapper)
2439 applied : array
2440 """
-> 2441 return self._arrmap(self.values, mapper)
2442
2443 def isin(self, values, level=None):
pandas\src\algos_common_helper.pxi in pandas.algos.arrmap_object (pandas\algos.c:46681)()
TypeError: 'Series' object is not callable
-------
Any ideas as to why? it seems pretty straight forward yet I can't find the cause of my problem.
Related
I'm trying to use DataFrame.map_partitions() from Dask to apply a function on each partition. The function takes in input a list of values and have to return the rows of the dataframe partition that contains these values, on a specific column (using loc() and isin()).
The issue is that I get this error:
"index = partition_info['number'] - 1
TypeError: 'NoneType' object is not subscriptable"
When I print partition_info, it prints None hundreds of times (but I only have 60 elements in the loop so we expect only 60 prints), is it normal to print None because it's a child process or am I missing something with partition_info? I cannot find useful information on that.
def apply_f(df, barcodes_per_core: List[List[str]], partition_info=None):
print(partition_info)
index = partition_info['number'] - 1
indexes = barcodes_per_core[index]
return df.loc[df['barcode'].isin(indexes)]
df = from_pandas(df, npartitions=nb_cores)
dfs_per_core = df.map_partitions(apply_f, barcodes_per_core, meta=df)
dfs_per_core = dfs_per_core.compute(scheduler='processes')
=> Doc of partition_info at the end of this page.
It's not clear why things are not working on your end, one potential thing is that you are re-using df multiple times. Here's a MWE that works:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(range(10), columns=["a"])
ddf = dd.from_pandas(df, npartitions=3)
def my_func(d, x, partition_info=None):
print(x, partition_info)
ddf.map_partitions(my_func, 3, meta=df.head()).compute(scheduler='processes')
I am a complete Python and Pandas novice. I am following a tutorial, and so far have the following code:
import numpy as np
import pandas as pd
import plotly as pyplot
import datetime
df = pd.read_csv("GlobalLandTemperaturesByCountry.csv")
df = df.drop("AverageTemperatureUncertainty", axis=1)
df = df.rename(columns={"dt": "Date"})
df = df.rename(columns={"AverageTemperature": "AvTemp"})
df = df.dropna()
df_countries = df.groupby(["Country", "Date"]).sum().reset_index().sort_values("Date", ascending=False)
start_date = "2001-01-01"
end_date = "2002-01-01"
mask = (df_countries["Date"] > start_date) & (df_countries["Date"] <= end_date)
df_mask = df_countries.loc(mask)
When I try and run the code, I get an error on the last line, i.e. df_mask = df_countries.loc(mask), the error being:
TypeError 'Series' objects are mutable, thus they cannot be hashed
I have already found several StackOverflow answers for this error, but none seem to match my scenario enough to help. Why am I getting this error?
In above example df_countries is dataframe and mask seems to be condition which is to be applied on this dataframe.
The object is mutable, meaning that its value can be changed without reassigning it the same variable, its contents will change at some point in the code. As a result, its hash value will change, so they cannot be hashed.
Try:
df_mask = df_countries.loc[(mask)]
How can I append a dataframe to another dataframe which is already saved in a file without loading it from the file? (Python 3.6 & Pandas 1.0.1)
Example:
import pandas as pd
data = [[['A01','A02'],'B0','C0'],[['A11','A12'],'B1','C1'],[['A21','A22'],'B2','C2']]
df = pd.DataFrame(data,columns=['A','B','C'])
data2 = [[['A31','A32'],'B3','C3'],[['A41','A42'],'B4','C4'],[['A51','A52'],'B5','C5']]
df2 = pd.DataFrame(data2,columns=['A','B','C'])
print(df.append(df2,ignore_index=True))
#version 1:
store = pd.HDFStore('test.h5','a')
store.append(key='foo',value=df)#, format='t', data_columns=True)
store.append(key='foo',value=df2)#, format='t', data_columns=True, append=True)
#version 2
df.to_hdf(path_or_buf='test.h5',key='foo',mode='w',format='t')
df2.to_hdf(path_or_buf='test.h5',key='foo',mode='a',append=True,format='t',data_columns=True)
#version 3
df.to_hdf(path_or_buf='test.h5',key='foo',mode='w',format='f')
df2.to_hdf(path_or_buf='test.h5',key='foo',mode='a',append=True,format='f',data_columns=True)
df3 = pd.read_hdf('test.h5',key='foo',mode='r')
print(df3)
version 1: TypeError: object of type 'int' has no len()
version 2: TypeError: object of type 'int' has no len()
version 3: ValueError: Can only append to Tables
This question was asked similarly here but quite a while ago. I tried it with an older pandas version but this causes even more problems.
EDIT:
It seems that the issue are the arrays as content. If I use only the Bs and Cs, like so, it works:
data = [['B0','C0'],['B1','C1'],['B2','C2']]
df = pd.DataFrame(data,columns=['B','C'])
data2 = [['B3','C3'],['B4','C4'],['B5','C5']]
df2 = pd.DataFrame(data2,columns=['B','C'])
Does anybody know a possibility how I can get it to work despite using arrays as content?
I am trying to load a statsmodels dataset as I saw on a tutorial, but I keep getting an error.
import statsmodels as sm
import pandas as pd
data = sm.datasets.co2.load_pandas()
co2 = data.data
co2['ds'] = co2.index
co2.rename(columns={'co2': 'y'}, inplace=True)
co2.tail()
This is the error I am getting:
TypeError: new() got an unexpected keyword argument 'format'
looks like the problem is with original function "load_pandas", the "format" parameter no long exists in the new version of pd.DatetimeIndex, for details please refer to https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DatetimeIndex.html
def load_pandas():
data = load()
# pandas <= 0.12.0 fails in the to_datetime regex on Python 3
index = pd.DatetimeIndex(start=data.data['date'][0].decode('utf-8'),
periods=len(data.data), format='%Y%m%d',
freq='W-SAT')
dataset = pd.DataFrame(data.data['co2'], index=index, columns=['co2'])
#NOTE: this is how I got the missing values in co2.csv
#new_index = pd.DatetimeIndex(start='1958-3-29', end=index[-1],
# freq='W-SAT')
#data.data = dataset.reindex(new_index)
data.data = dataset
return data
so my solution of working around this is below:
load data into pandas DataFrame
co2 = pd.DataFrame(sm.datasets.co2.load().data)
convert bytes into string and then datetime
co2['date'] = pd.to_datetime(co2.date.apply(lambda x: x.decode("utf-8")))
set the date as index
co2.set_index('date',inplace=True)
output:
I have written a pandas function and it runs fine (the second last line of my code). When i try to assign my function's output to columns in dataframes i get an error TypeError: unhashable type: 'list'
i posted a something similar and i am using method shown in the answer of that question in the below function. But still it fails :(
import pandas as pd
import numpy as np
def benford_function(value):
if value == '':
return []
if ("." in value):
before_decimal=value.split(".")[0]
if len(before_decimal)==0:
bd_first="0"
bd_second="0"
if len(before_decimal)>1:
before_decimal=before_decimal[:2]
bd_first=before_decimal[0]
bd_second=before_decimal[1]
elif len(before_decimal)==1:
bd_first="0"
bd_second=before_decimal[0]
after_decimal=value.split(".")[1]
if len(after_decimal)>1:
ad_first=after_decimal[0]
ad_second=after_decimal[1]
elif len(after_decimal)==1:
ad_first=after_decimal[0]
ad_second="0"
else:
ad_first="0"
ad_second="0"
else:
ad_first="0"
ad_second="0"
if len(value)>1:
bd_first=value[0]
bd_second=value[1]
else:
bd_first="0"
bd_second=value[0]
return pd.Series([bd_first,bd_second,ad_first,ad_second])
df = pd.DataFrame(data = {'a': ["123"]})
df.apply(lambda row: benford_function(row['a']), axis=1)
df[['bd_first'],['bd_second'],['ad_first'],['ad_second']]= df.apply(lambda row: benford_function(row['a']), axis=1)
Change:
df[['bd_first'],['bd_second'],['ad_first'],['ad_second']] = ...
to
df[['bd_first', 'bd_second', 'ad_first', 'ad_second']] = ...
This will fix your type-error, since index elements must be hashable. The way you tried to index into the Dataframe by passing a tuple of single-element lists will interpret each of those single element lists as indices