Pandas: How to work with sliced data using .loc? - python

df1 = pd.DataFrame({'id_imp': ['a', 'b',
'c','d','e','f','g'],
'name': ['jon', 'jon', 'tom', 'ber', 'gary','gary',
'zul'],
'state' : ['ca', 'ny', 'tn','ca','tn','tn','il'],
'county': ['wood','wood','fair','bridge','rosewelt','rosewelt','lili']})
df2 = pd.DataFrame({'id_sal': ['h', 'i', 'j','k','l'],
'name': ['jon', 'zolie', 'tom', 'ber', 'gary'],
'state' : ['ca', 'ch', 'tn','ca','tn'],
'county': ['wood','plas','fair','bridge','rosewelt']})
df3 = df1.loc[(~df1.name.isin(df2.name))]
I am trying to do small operation by writing below code but its giving me a warning: What could be the problem?
df3['name'] = df3.loc[:, 'name'].fillna(0)
SettingWithCopyWarning: Try using .loc[row_indexer,col_indexer] = value instead

It looks like:
df['name'] - returns an entirely new object, i.e. a copy
But you want to work with the original object. so use:
df3.loc[:, 'name'] - which returns a subset of the original object, i.e. a view
df3.loc[:, 'name'] = df3['name'].fillna(0)
If you are trying to select rows and columns in the same line of code .loc[] works better.

Related

Place math operation or values side by side in database Pandas

I'm just giving one dataset example of what I need to do with a real dataset at my company with python/pandas.
import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'product_code': ['A', 'B', 'C', 'A', 'B', 'C'],
'price': range(6),
'region': rng.randint(0, 10, 6)},
columns = ['product_code', 'price', 'region'])
df
It will give us:
How do I place products showing side by side the current price, the minimun price and the max price like this:
I've just tried a groupby and aggregate function but I cound't get what I want.
df.groupby('product_code').aggregate({
'price' :'price',
'price':'min',
'price': 'max'
})
min_ = df.groupby('product_code')['price'].min()
max_ = df.groupby('product_code')['price'].max()
df['min'] = df['product_code'].apply(lambda x: min_[x])
df['max'] = df['product_code'].apply(lambda x: max_[x])

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

list of DataFrames as an argument of function/loop

I have multiple DataFrame and I need to perform various operations on them. I want to put them in one list to avoid listing them all the time as in the example bellow:
for df in (df1, df2,df3,df4,df5,df6,df7):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")
I tried to make a list of them (DataFrames=[df1,df2,df3,df4,df5,df6,df7]) but it's working neither in the loop or as an argument of a function.
for df in (DataFrame):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")
you can place the dataframes on a list and add the columns like this:
import pandas as pd
from pandas import DataFrame
data = {'COUNTRY': ['country1', 'country2', 'country3'],
'2018': [12.0, 27, 35],
'2019': [23, 39.6, 40.3],
'2020': [35, 42, 56]}
df_list = [DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data)]
def change_dataframes(data_frames=None):
for df in data_frames:
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
pd.to_numeric(df['2018'], downcast="float")
pd.to_numeric(df['2019'], downcast="float")
return data_frames
Using nunvie's answer as a base, here is another option for you:
import pandas as pd
data = {
'COUNTRY': ['country1', 'country2', 'country3'],
'2018': ['12.0', '27', '35'],
'2019': ['2:3', '3:9.6', '4:0.3'],
'2020': ['35', '42', '56']
}
df_list = [pd.DataFrame(data) for i in range(5)]
def data_prep(df: pd.DataFrame):
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
df['2018'] = pd.to_numeric(df['2018'], downcast="float")
df['2019'] = pd.to_numeric(df['2019'], downcast="float")
return df
new_df_list = map(data_prep, df_list)
The improvements (in my opinion) are as follows. First, it is more concise to use list comprehension for the test setup (that's not directly related to the answer). Second, pd.to_numeric doesn't have inplace (at least in pandas 1.2.3). It returns the series you passed if the parsing succeeded. Thus, you need to explicitly say df['my_col'] = pd.to_numeric(df['my_col']).
And third, I've used map to apply the data_prep function to each DataFrame in the list. This makes data_prep responsible for only one data frame and also saves you from writing loops. The benefit is leaner and more readable code, if you like the functional flavour of it, of course.

Comparing date columns between two dataframes

The project I'm working on requires me to find out which 'project' has been updated since the last time it was processed. For this purpose I have two dataframes which both contain three columns, the last one of which is a date signifying the last time a project is updated. The first dataframe is derived from a query on a database table which records the date a 'project' is updated. The second is metadata I store myself in a different table about the last time my part of the application processed a project.
I think I came pretty far but I'm stuck on the following error, see the code provided below:
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
unprocessed = lastmatch[~lastmatch.isin(processed)].dropna()
processed.set_index(['projectid', 'stage'], inplace=True)
lastmatch.set_index(['projectid', 'stage'], inplace=True)
processed.sort_index(inplace=True)
lastmatch.sort_index(inplace=True)
print(lastmatch['lastmatchdate'])
print(processed['process_date'])
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
The result I want to achieve is a dataframe containing the rows where the 'lastmatchdate' is greater than the date that the project was last processed (process_date). However this line:
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
produces a ValueError: Can only compare identically-labeled Series objects. I think it might be a syntax I don't know of or got wrong.
The output I expect is in this case:
lastmatchdate
projectid stage
1 c 2020-08-31
So concretely the question is: how do I get a dataframe containing only the rows of another dataframe having the (datetime) value of column a greater than column b of the other dataframe.
merged = pd.merge(processed, lastmatch, left_index = True, right_index = True)
merged = merged.assign(to_process = merged['lastmatchdate']> merged['process_date'])
You will get the following:
process_date lastmatchdate to_process
projectid stage
1 c 2020-08-31 2020-08-31 False
2 v 2013-11-24 2013-11-24 False
you 've receiver ValueError because you tried to compare two different dataframes, if you want to compare row by row two dataframes, merge them before
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
df=pd.merge(lastmatch,processed,on=['stage','projectid'])
df=df[
df.lastmatchdate>df.process_date
]
print(df)
projectid stage lastmatchdate process_date
0 1 c 2020-08-31 2020-08-30

Dask categorize() won't work after using .loc

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?
The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

Categories

Resources