I would like to know if it is possible to create a dataframe from two dictionaries.
I get two dictionaries like this:
dict= {'MO': ['N-2', 'N-8', 'N-7', 'N-6', 'N-9'], 'MO2': ['N0-6'], 'MO3': ['N-2']}
My result will be like this :
ID NUM
0 MO 'N-2', 'N-8', 'N-7', 'N-6', 'N-9'
1 MO2 'N0-6'
2 MO3 'N-2'
I try to obtain this result but in the column with the value I get [] and I can't remove it
liste_id=list(dict.keys())
liste_num=list(dict.values())
df = pandas.DataFrame({'ID':liste_id,'NUM':liste_num})
Merge the values in the dictionary into a string, before creating the dataframe; this ensures the arrays are of the same length
pd.DataFrame([(key, ", ".join(value))
for key, value in dicts.items()],
columns = ['ID', 'NUM'])
ID NUM
0 MO N-2, N-8, N-7, N-6, N-9
1 MO2 N0-6
2 MO3 N-2
Related
I am new to pandas.
So I am try to append column names in a list whose correlations is greater then zero.
here is my code
corr_matrix = df_train.corr()
corr_matrix["failure"].sort_values(ascending=False)
useful_features = []
for f in corr_matrix["failure"]:
if f > 0:
useful_features.append(df_train.columns)
print(useful_features)
But this is appending all column names to the list
[Index(['id', 'product_code', 'loading', 'attribute_0', 'attribute_1',
'attribute_2', 'attribute_3', 'measurement_0', 'measurement_1',
'measurement_2', 'measurement_3', 'measurement_4', 'measurement_5',
'measurement_6', 'measurement_7', 'measurement_8', 'measurement_9',
'measurement_10', 'measurement_11', 'measurement_12', 'measurement_13',
'measurement_14', 'measurement_15', 'measurement_16', 'measurement_17',
'failure', 'kfold'],
.
.
.
I am not pasting complete output
What I want is
useful_features = ['failure','loading',...,'kfold']
Output of
corr_matrix["failure"].sort_values(ascending=False)
failure 1.000000
loading 0.129089
measurement_17 0.033905
measurement_5 0.018079
measurement_8 0.017119
measurement_7 0.016787
measurement_2 0.015808
measurement_6 0.014791
measurement_0 0.009646
attribute_2 0.006337
measurement_14 0.006211
measurement_12 0.004398
measurement_3 0.003577
measurement_16 0.002237
kfold 0.000130
measurement_10 -0.001515
measurement_13 -0.001831
measurement_15 -0.003544
measurement_9 -0.003587
measurement_11 -0.004801
id -0.007545
measurement_4 -0.010488
measurement_1 -0.010810
attribute_3 -0.019222
Name: failure, dtype: float64
Is there any way to append the column names?
df_train.columns.values also appends all names in the list
You can use indexing to do this:
print(
corr_matrix.index[corr_matrix["failure"] > 0]
)
This translates to
Get the index from corr matrix
Evaluate when "failure" column is > 0
Use the above evaluation to filter the index
Description:
I have a GUI that allows the user to add variables that are displayed in a dataframe. As the variables are added, they are automatically numbered, ex.'FIELD_0' and 'FIELD_1' etc and each variable has a value associated with it. The data is actually row-based instead of column based, in that the 'FIELD' ids are in column 0 and progress downwards and the corresponding value is in column 1, in the same corresponding row. As shown below:
0 1
0 FIELD_0 HH_5_MILES
1 FIELD_1 POP_5_MILES
The user is able to reorder these values and move them up/down a row. However, it's important that the number ordering remains sequential. So, if the user positions 'FIELD_1' above 'FIELD_0' then it gets re-numbered appropriately. Example:
0 1
0 FIELD_0 POP_5_MILES
1 FIELD_1 HH_5_MILES
Currently, I'm using the below code to perform this adjustment - this same re-numbering occurs with other variable names within the same dataframe.
df = pandas.DataFrame({0:['FIELD_1','FIELD_0']})
variable_list = ['FIELD', 'OPERATOR', 'RESULT']
for var in variable_list:
field_list = ['%s_%s' % (var, _) for _, field_name in enumerate(df[0].isin([var]))]
field_count = 0
for _, field_name in enumerate(df.loc[:, 0]):
if var in field_name:
df.loc[_, 0] = field_list[field_count]
field_count += 1
This gets me the result I want, but it seems a bit inelegant. If there is a better way, I'd love to know what it is.
It appears you're looking to overwrite the Field values so that they always appear in order starting with 0.
We can filter to only rows which str.contains the word FIELD. Then assign those to a list comprehension like field_list.
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
# Select Where Values are Field
m = df[0].str.contains('FIELD')
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'FIELD_{n}' for n in range(m.sum())]
print(df)
df:
0
0 FIELD_0
1 OTHER_1
2 FIELD_1
3 OTHER_0
For multiple variables:
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
variable_list = ['FIELD', 'OTHER']
for v in variable_list:
# Select Where Values are Field
m = df[0].str.contains(v)
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'{v}_{n}' for n in range(m.sum())]
df:
0
0 FIELD_0
1 OTHER_0
2 FIELD_1
3 OTHER_1
You can use sort values as below:
def f(x):
l=x.split('_')[1]
return int(l)
df.sort_values(0, key=lambda col: [f(k) for k in col]).reset_index(drop=True)
0
0 FIELD_0
1 FIELD_1
I have a dataframe of two columns where one category (area_id) englobes the other one (location_id), how can I get a dictionary of lists where keys are the "area_id" and their respective values are lists of "location_id" present in the given "area_id"?
Concretely, given the dataframe:
df = pd.DataFrame(data={'area_id': ['area_1', 'area_1', 'area_1', 'area_2', 'area_2', 'area_3'],
'location_id': ['loc_a', 'loc_a', 'loc_b', 'loc_c', 'loc_d', 'loc_e']})
area_id location_id
0 area_1 loc_a
1 area_1 loc_a
2 area_1 loc_b
3 area_2 loc_c
4 area_2 loc_d
5 area_3 loc_e
I would like the following dictionary:
{'area_1': ['loc_a', 'loc_b'],
'area_2': ['loc_c', 'loc_d'],
'area_3': ['loc_e']}
Code below is a working solution, but I am wondering if there is a more elegant solution which avoids using a "for" loop:
res = {}
for _area in df['area_id'].unique():
_locs = list(df[df['area_id'] == _area]['location_id'].unique())
res[_area] = _locs
Thank you
Use:
df.drop_duplicates().groupby('area_id')['location_id'].agg(list).to_dict()
Output:
{'area_1': ['loc_a', 'loc_b'],
'area_2': ['loc_c', 'loc_d'],
'area_3': ['loc_e']}
The dataframe which is in below format has to be converted like "op_df",
ip_df=pd.DataFrame({'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]})
ip_df:
class details
0 I [{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}]
1 II [{'sec':'B','assigned_to':'joe'}]
2 III []
The required output dataframe is suppose to be,
op_df:
class sec assigned_to
0 I A tom
1 I B sam
2 II B joe
3 III NaN NaN
How to change each dictionaries of "details" column as a new row with keys of the dictionary as column name and value of the dictionary as its respective column value?
I have tried with,
ip_df.join(ip_df['details'].apply(pd.Series))
whereas, I am unable to frame like "op_df".
I am sure there are better ways to do it, but I had to deconstruct your details list and create your dataframe as follows:
dict_values = {'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]}
all_values = []
for cl, detail in zip(dict_values['class'], dict_values['details']):
if len(detail) > 0:
for innerdict in detail:
row = {'class': cl}
for innerkey in innerdict.keys():
row[innerkey] = innerdict[innerkey]
all_values.append(row)
else:
row = {'class': cl}
all_values.append(row)
op_df = pd.DataFrame(all_values)
I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?