Place math operation or values side by side in database Pandas - python

I'm just giving one dataset example of what I need to do with a real dataset at my company with python/pandas.
import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'product_code': ['A', 'B', 'C', 'A', 'B', 'C'],
'price': range(6),
'region': rng.randint(0, 10, 6)},
columns = ['product_code', 'price', 'region'])
df
It will give us:
How do I place products showing side by side the current price, the minimun price and the max price like this:
I've just tried a groupby and aggregate function but I cound't get what I want.
df.groupby('product_code').aggregate({
'price' :'price',
'price':'min',
'price': 'max'
})

min_ = df.groupby('product_code')['price'].min()
max_ = df.groupby('product_code')['price'].max()
df['min'] = df['product_code'].apply(lambda x: min_[x])
df['max'] = df['product_code'].apply(lambda x: max_[x])

Related

Group by time intervals and additional attribute

I have this data:
import pandas as pd
data = {
'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}
df = pd.DataFrame(data)
I want to create two sets of CSVs:
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day) and by "FROM" column - there will be no "TO" column here.
One CSV for each Type of vehicles (8 in total) where the rows will by grouped by / aggregated by time-stamp (for 15 minute intervals throughout the day), by "FROM" column and by "TO" column.
The difference between the two sets is that one will count all FROM items and the other will group them and count them by pairs of FROM and TO.
The output will be an aggregated sum of vehicles of a given type for 15 minute intervals summed up by FROM column and also a combination of FROM and TO column.
1st output can look like this for each vehicle type:
2nd output:
I tried using Pandas groupby() and resample() but due to my limited knowledge to no success. I can do this in Excel but very inefficiently. I want to learn Python more and be more efficient, therefore I would like to code it in Pandas.
I tried df.groupby(['FROM', 'TO']).count() but I lack the knowledge to usit for what I need. I keep either getting error when I do something I should not or the output is not what I need.
I tried df.groupby(pd.Grouper(freq='15Min', )).count() but it seems I perhaps have incorrect data type.
And I don't know if this is applicable.
If I understand you correctly, one approach could be as follows:
Data
import pandas as pd
# IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
from pandas.tseries.frequencies import to_offset
# adjusting last 2 timestamps to get a diff interval group
data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33',
'2022-11-03 00:00:35', '2022-11-03 00:00:46',
'2022-11-03 00:20:21', '2022-11-03 00:21:30'],
'from': ['A', 'A', 'A', 'A', 'B', 'C'],
'to': ['B', 'B', 'B', 'C', 'C', 'B'],
'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}
df = pd.DataFrame(data)
print(df)
timestamp from to type
0 2022-11-03 00:00:06 A B Car
1 2022-11-03 00:00:33 A B Car
2 2022-11-03 00:00:35 A B Van
3 2022-11-03 00:00:46 A C Car
4 2022-11-03 00:20:21 B C HGV
5 2022-11-03 00:21:30 C B Van
# e.g. for FROM we want: `A`, `4` (COUNT), `00:15` (TIME-END)
# e.g. for FROM-TO we want: `A-B`, 3 (COUNT), `00:15` (TIME-END)
# `A-C`, 1 (COUNT), `00:15` (TIME-END)
Code
# convert time strings to datetime and set column as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
# add `15T (== mins) offset to datetime vals
df.index = df.index + to_offset('15T')
# create `dict` for conversion of `col names`
cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}
# we're doing basically the same for both outputs, so let's use a for loop on a nested list
nested_list = [['from'],['from','to']]
for item in nested_list:
# groupby `item` (i.e. `['from']` and `['from','to']`)
# use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x on: `from`
# and get the `count`. Finally, reset the index
out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
# rename the columns using our `cols` dict
out = out.rename(columns=cols)
# convert timestamps like `'2022-11-03 00:15:00' to `00:15`
out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
# rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
if 'TO' in out.columns:
out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
else:
out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
# write output to `csv file`; e.g. use an `f-string` to customize file name
out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
# `index=False` avoids writing away the index
Output (loaded in excel)
Relevant documentation:
pd.to_datetime, df.set_index, .to_offset
df.groupby, .resample
df.rename
.dt.strftime
df.to_csv

Compare df's including detailed insight in data

I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

Missing value Imputation based on regression in pandas

i want to inpute the missing data based on multivariate imputation, in the below-attached data sets, column A has some missing values, and Column A and Column B have the correlation factor of 0.70. So I want to use a regression kind of realationship so that it will build the relation between Column A and Column B and impute the missing values in Python.
N.B.: I can do it using Mean, median, and mode, but I want to use the relationship from another column to fill the missing value.
How to deal the problem. your solution, please
import pandas as pd
from sklearn.preprocessing import Imputer
import numpy as np
# assign data of lists.
data = {'Date': ['9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14', '9/21/14'],
'A': [77.13, 39.58, 33.70, np.nan, np.nan,39.66, 64.625, 80.04, np.nan ,np.nan ,19.43, 54.375, 38.41],
'B': [19.5, 21.61, 22.25, 25.05, 24.20, 23.55, 5.70, 2.675, 2.05,4.06, -0.80, 0.45, -0.90],
'C':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c', 'c']}
# Create DataFrame
df = pd.DataFrame(data)
df["Date"]= pd.to_datetime(df["Date"])
# Print the output.
print(df)
Use:
dfreg = df[df['A'].notna()]
dfimp = df[df['A'].isna()]
from sklearn.neural_network import MLPRegressor
regr = MLPRegressor(random_state=1, max_iter=200).fit(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
regr.predict(dfimp['B'].values.reshape(-1, 1))
Note that in the provided data correlation of the A and B columns are very low (less than .05).
For replacing the imputed values with empty cells:
s = df[df['A'].isna()]['A'].index
df.loc[s, 'A'] = regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
Output:

list of DataFrames as an argument of function/loop

I have multiple DataFrame and I need to perform various operations on them. I want to put them in one list to avoid listing them all the time as in the example bellow:
for df in (df1, df2,df3,df4,df5,df6,df7):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")
I tried to make a list of them (DataFrames=[df1,df2,df3,df4,df5,df6,df7]) but it's working neither in the loop or as an argument of a function.
for df in (DataFrame):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")
you can place the dataframes on a list and add the columns like this:
import pandas as pd
from pandas import DataFrame
data = {'COUNTRY': ['country1', 'country2', 'country3'],
'2018': [12.0, 27, 35],
'2019': [23, 39.6, 40.3],
'2020': [35, 42, 56]}
df_list = [DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data)]
def change_dataframes(data_frames=None):
for df in data_frames:
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
pd.to_numeric(df['2018'], downcast="float")
pd.to_numeric(df['2019'], downcast="float")
return data_frames
Using nunvie's answer as a base, here is another option for you:
import pandas as pd
data = {
'COUNTRY': ['country1', 'country2', 'country3'],
'2018': ['12.0', '27', '35'],
'2019': ['2:3', '3:9.6', '4:0.3'],
'2020': ['35', '42', '56']
}
df_list = [pd.DataFrame(data) for i in range(5)]
def data_prep(df: pd.DataFrame):
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
df['2018'] = pd.to_numeric(df['2018'], downcast="float")
df['2019'] = pd.to_numeric(df['2019'], downcast="float")
return df
new_df_list = map(data_prep, df_list)
The improvements (in my opinion) are as follows. First, it is more concise to use list comprehension for the test setup (that's not directly related to the answer). Second, pd.to_numeric doesn't have inplace (at least in pandas 1.2.3). It returns the series you passed if the parsing succeeded. Thus, you need to explicitly say df['my_col'] = pd.to_numeric(df['my_col']).
And third, I've used map to apply the data_prep function to each DataFrame in the list. This makes data_prep responsible for only one data frame and also saves you from writing loops. The benefit is leaner and more readable code, if you like the functional flavour of it, of course.

Dask categorize() won't work after using .loc

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?
The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

Categories

Resources