Join Recarrays by attributes in Python - python

I am trying to join recarrys in python such that the same value joins to many elements. The following code works when it is a 1:1 ratio, but when I am trying to do many:1, it only joins one instance:
import numpy as np
import matplotlib
# First data structure
sex = np.array(['M', 'F', 'M', 'F', 'M', 'F'])
causes = np.array(['c1', 'c1', 'c2', 'c2', 'c3', 'c3'])
data1 = np.core.records.fromarrays([sex, causes], names='sex, causes')
# Second data structure
causes2 = np.array(['c1', 'c2', 'c3'])
analyst = np.array(['p1', 'p2', 'p3'])
data2 = np.core.records.fromarrays([causes2, analyst], names='causes, analyst')
# Join on Cause
all_data = matplotlib.mlab.rec_join(('causes'), data1, data2, jointype='leftouter')
What I would like the all_data recarray to contain is all of the data from data1 with the corresponding analyst indicated in data2.

There might be a good use of record array, but I thought python dict should be as good here... Want to know numpy way of doing this myself, too, if it is good.
dct = dict(zip(data2['causes'], data2['analyst']))
all_data = mlab.rec_append_fields(data1, 'analyst',
[dct[x] for x in data1['causes']])

Related

Compare df's including detailed insight in data

I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

Missing value Imputation based on regression in pandas

i want to inpute the missing data based on multivariate imputation, in the below-attached data sets, column A has some missing values, and Column A and Column B have the correlation factor of 0.70. So I want to use a regression kind of realationship so that it will build the relation between Column A and Column B and impute the missing values in Python.
N.B.: I can do it using Mean, median, and mode, but I want to use the relationship from another column to fill the missing value.
How to deal the problem. your solution, please
import pandas as pd
from sklearn.preprocessing import Imputer
import numpy as np
# assign data of lists.
data = {'Date': ['9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14', '9/21/14'],
'A': [77.13, 39.58, 33.70, np.nan, np.nan,39.66, 64.625, 80.04, np.nan ,np.nan ,19.43, 54.375, 38.41],
'B': [19.5, 21.61, 22.25, 25.05, 24.20, 23.55, 5.70, 2.675, 2.05,4.06, -0.80, 0.45, -0.90],
'C':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c', 'c']}
# Create DataFrame
df = pd.DataFrame(data)
df["Date"]= pd.to_datetime(df["Date"])
# Print the output.
print(df)
Use:
dfreg = df[df['A'].notna()]
dfimp = df[df['A'].isna()]
from sklearn.neural_network import MLPRegressor
regr = MLPRegressor(random_state=1, max_iter=200).fit(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
regr.predict(dfimp['B'].values.reshape(-1, 1))
Note that in the provided data correlation of the A and B columns are very low (less than .05).
For replacing the imputed values with empty cells:
s = df[df['A'].isna()]['A'].index
df.loc[s, 'A'] = regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
Output:

extract value from object to column python

I have a output in below format
values = {'A': (node:A connections:{B:[0.9565217391304348], C:[0.5], D:[0], E:[0], F:[0], I:[0]}),
'B': (node:B connections:{F:[0.7], D:[0.631578947368421], J:[0]}),
'C': (node:C connections:{D:[0.5]})}
when i print(type(values)) output is pm4py.objects.heuristics_net.obj.HeuristicsNet
I want to take NODE & CONNECTION, then create two columns which has all connections to individual nodes as seen below
import pandas as pd
df = pd.DataFrame({'Nodes':['A','A','A','A','A','A','B','B','B','C'], 'Connection':['B','C','D','E,'F','I', 'F', 'D', 'J', 'D']})
It is simply a combination of each node with each of its connection. I have worked on simple dictionary before, but i am unaware to extract info as required here.
How to proceed further?

Creating & using categorical data type with pandas

I'm having trouble changing the type of my variable to a categorical data type.
My variable is called "Energy class" and contains the following values:
A++, A+, A, B, C, D, E, F, G.
I want to change the type to a category and order the categories in that same order.
Hence: A++ = 1, A+ = 2, A = 3, B = 4 , etc.
I will also have to perform the same manipulation with another variable, "Condition of the building", which conains the following values: "Very good, "Good", "To be restored".
I tried using the pandas set_categories() method. But it didn't work. There is very little information on how to use it in the documentation.
Anyone knows how to deal with this?
Thank you
You can use map:
energy_class = {'A++':1, 'A+':2,...}
df['Energy class'] = df['Energy class'].map(energy_class)
A bit fancier when you have ordered list of the classes
energy_classes = ['A++', 'A+',...]
df['Energy_class'] = df['Energy class'].map(dict(**enumerate(energy_classes,1))
You can use ordered pd.Categorical:
df['energy_class'] = pd.Categorical(
df['energy_class'],
categories=['A++', 'A+', 'A', 'B', 'C', 'D', 'E', 'F', 'G'],
ordered=True)

Dask categorize() won't work after using .loc

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.
However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)
NOTE: I have created a minimum example as suggested using pandas instead of read_csv().
import pandas as pd
import dask.dataframe as dd
# Specify dtypes
b_dtypes = {
'symbol': 'category',
'price': 'float64',
}
i_dtypes = {
'symbol': 'category',
'price': 'object'
}
# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = df.loc[:, column].astype(dtype)
return df
# Set up our test data
data = [
['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)
#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#
# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]
# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)
# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()
# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)
#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#
UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?
The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc
import pandas as pd
import dask.dataframe as dd
# Set up our test data
data = [['B', 'IBN', '9.9800'],
['B', 'PAY', '21.5000'],
['I', 'PAY', 'seventeen'],
['I', 'SPY', 'ten']
]
# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')
# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)
# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)
# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])
# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

Categories

Resources