Related
I have the following dataframe (the real one has a lot more columns and rows, so just using this as an example):
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
I'd like to write a function to perform calculations on the dataframe, for specific columns. The calculation is in the code below.
As I'd only want to apply the code to specific columns, I've set up a list of columns, and as there is a pre-defined 'factor' we need to take into account in the calculation, I set this up too:
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row):
return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
Then, I apply the function to the dataframe, and I want to overwrite the original column values with the new ones, so I do this:
for cols in df.columns:
df[cols] = df[cols].apply(multiply_columns)
But I get the following error:
~\AppData\Local\Temp/ipykernel_8544/3939806184.py in multiply_columns(row)
3
4 def multiply_columns(row):
----> 5 return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
6
7
TypeError: string indices must be integers
But the values I'm using in the calculation aren't strings:
sample object
sample id int64
replicate int64
taste float64
smell float64
shape float64
volume int64
weight float64
dtype: object
The desired output would be:
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 0.0074, 1: 0.028366667, 2: 0.2183, 3: 3.08333e-05},
'smell': {0: 0.123333333, 1: 0.141833333, 2: 0.01295, 3: 0.032683333},
'shape': {0: 2.46667e-05, 1: 0.001233333, 2: 0.00074, 3: 0.067833333},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
Can anyone kindly show me the errors of my ways
This has a few issues.
If you wanted to index elements in row, the index you're using is a string (the column name) rather than an integer (like an index). To get an index for the column names you're interested in, you could use this:
cols = ['taste', 'smell', 'shape']
cols_idx = [df.columns.get_loc(col) for col in cols]
However, if I understand your question, you could perform this operation on columns directly with the understanding that the operation will be performed on each row. See a test case that worked for me:
import pandas as pd
df = pd.DataFrame({'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}})
cols = ['taste', 'smell', 'shape']
factor = 72
for col in cols:
df[col] = ((df[col] / df['volume']) * (factor * df['volume'] / df['weight']) / 1000)
Note that your line
for cols in df.columns:
indicated you should run this operation on every column (cols became the index and was no longer your list).
You have to pass the column as well to the function.
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row,col):
return ((row[col]/ row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
for col in cols:
df[col] = df.apply(lambda x:multiply_columns(x,col),axis=1)
Also the output I'm getting is bit different from your desired output even though I used the same formula.
sample sample id replicate taste smell shape volume weight
0 orange 1 1 0.00720000000 0.12000000000 0.00002400000 23 12.00000000000
1 orange 1 2 0.25476923077 1.27384615385 0.01107692308 23 1.30000000000
2 banana 5 1 1.06200000000 0.06300000000 0.00360000000 23 2.40000000000
3 banana 5 2 0.00011250000 0.11925000000 0.24750000000 23 3.20000000000
I have data-frame which contains json column and is quiet huge and is not very efficient, i would like to store it as nested data frame.
So sample data-frame looks like:
id date ag marks
0 I2213 2022-01-01 13:28:05.448054 [{'type': 'A', 'values': {'X': {'F1': 0.1, 'F2': 0.2}, 'U': {'F1': 0.3, 'F2': 0.4}}}, {'type': 'B', 'results': {'Y': {'F1': 0.3, 'F2': 0.2}}}] [{'type': 'A', 'marks': {'X': 0.5, 'U': 0.7}}, {'type': 'B', 'marks': {'Y': 0.4}}]
1 I2213 2022-01-01 14:28:05.448054 [{'type': 'B', 'values': {'Z': {'F1': 0.4, 'F2': 0.2}}}] [{'type': 'A', 'marks': {'X': 0.4, 'U': 0.6}}, {'type': 'B', 'marks': {'Y': 0.3, 'Z': 0.4}}]
2 I2213 2022-01-03 15:28:05.448054 [{'type': 'A', 'values': {'X': {'F1': 0.2, 'F2': 0.1}}}] [{'type': 'A', 'marks': {'X': 0.2, 'U': 0.9}}, {'type': 'B', 'marks': {'Y': 0.2}}]
Expected output:
grouped by date. Sample code for generating sample dataframe:
from datetime import datetime, timedelta
def sample_data():
ag_data = [
"[{'type': 'A', 'values': {'X': {'F1': 0.1, 'F2': 0.2}, 'U': {'F1': 0.3, 'F2': 0.4}}}, {'type': 'B', 'results': {'Y': {'F1': 0.3, 'F2': 0.2}}}]",
"[{'type': 'B', 'values': {'Z': {'F1': 0.4, 'F2': 0.2}}}]",
"[{'type': 'A', 'values': {'X': {'F1': 0.2, 'F2': 0.1}}}]",
]
marks_data = [
"[{'type': 'A', 'marks': {'X': 0.5, 'U': 0.7}}, {'type': 'B', 'marks': {'Y': 0.4}}]",
"[{'type': 'A', 'marks': {'X': 0.4, 'U': 0.6}}, {'type': 'B', 'marks': {'Y': 0.3, 'Z': 0.4}}]",
"[{'type': 'A', 'marks': {'X': 0.2, 'U': 0.9}}, {'type': 'B', 'marks': {'Y': 0.2}}]",
]
date_data = [
datetime.now() - timedelta(3, seconds=7200),
datetime.now() - timedelta(3, seconds=3600),
datetime.now() - timedelta(1),
]
df = pd.DataFrame()
df['date'] = date_data
df['ag'] = ag_data
df['marks'] = marks_data
df['id'] = 'I2213'
return df
I tried with json normalization, but it's creating dataframe in columnar fashion like:
d = a['ag'].apply(lambda x: pd.json_normalize(json.loads(x.replace("'", '"'))))
gives dataframe with columns type values.X.F1 values.X.F2 values.U.F1 values.U.F2 results.Y.F1 results.Y.F2 issue is how to put dict keys like X,Y, F1,F2 as rows instead of columns.
Is it possible to achieve the desired format as shown in image?
I have tried by creating helper function.
def ag_col_helper(ag_df):
s = pd.json_normalize(json.loads(ag_df.replace("\'", "\"")))
s.set_index('type', inplace=True)
s1 = s.melt(ignore_index=False, var_name='feature')
split_vals = s1['feature'].str.split(".", n = 2, expand = True)
s1['name'] = split_vals[1]
s1['feature'] = split_vals[2]
return s1.groupby(['type', 'name', 'feature']).first().dropna()
def marks_col_helper(marks_df):
s = pd.json_normalize(json.loads(marks_df.replace("\'", "\"")))
s.set_index('type', inplace=True)
s1 = s.melt(ignore_index=False, var_name='name', value_name='marks')
split_vals = s1['name'].str.split(".", n = 2, expand = True)
s1['name'] = split_vals[1]
return s1.groupby(['type', 'name']).first().dropna()
Then this can be applied to the column ag
df['ag'] = df['ag'].apply(lambda x: do_something(x))
df['marks'] = df['marks'].apply(lambda x: do_something_marks(x))[0]
Then we would get for
df.iloc[0]['ag']
value
type name feature
A U F1 0.3
F2 0.4
X F1 0.1
F2 0.2
B Y F1 0.3
F2 0.2
df.iloc[0]['marks']
marks
type name
A U 0.7
X 0.5
B Y 0.4
I think this one is what you are expecting.
For grouping the date column you can create another column using df['Date'] = df['date'].dt.date and perform a groupby.
It appears that you can set data frames as values within a dataframe. This:
import pandas as pd
#creating outer df
df = pd.DataFrame([{'a':1, 'b':2, 'inner':None},{'a':3, 'b':4, 'inner':None}])
#creating inner dfs
inner_1 = pd.DataFrame([{'time': 0, 'e': 1}, {'time': 1, 'e': 2}])
inner_2 = pd.DataFrame([{'time': 0, 'e': 6}, {'time': 1, 'e': 7}])
inners = [inner_1, inner_2]
df['inner'] = inners
print(df)
results in this:
a b inner
0 1 2 time e
0 0 1
1 1 2
1 3 4 time e
0 0 6
1 1 7
the print out quickly gets confusing, but it seems like it's what you want.
for your data specifically, take your lists of dicts and convert them to a df with pd.DataFrame. If you want to turn all your lists to dataframes, you can use something like this:
import pandas as pd
#creating outer df
df = pd.DataFrame([{'a':1, 'b':2, 'inner':None},{'a':3, 'b':4, 'inner':None}])
#creating inner dfs
inner_1 = [{'time': 0, 'e': 1}, {'time': 1, 'e': 2}]
inner_2 = [{'time': 0, 'e': 6}, {'time': 1, 'e': 7}]
inners = [inner_1, inner_2]
df['inner'] = inners
print('un-transformed')
print(df)
#transforming all lists into DFs
for i in range(df.shape[0]): #iterate over rows
for j in range(df.shape[1]): #iterate over columns
if type(df.iat[i,j]) == list: #filtering cells that are lists
df.iat[i, j] = pd.DataFrame(df.iat[i, j]) #convert to df
print("transformed")
print(df)
which returns
un-transformed
a b inner
0 1 2 [{'time': 0, 'e': 1}, {'time': 1, 'e': 2}]
1 3 4 [{'time': 0, 'e': 6}, {'time': 1, 'e': 7}]
transformed
a b inner
0 1 2 time e
0 0 1
1 1 2
1 3 4 time e
0 0 6
1 1 7
df = pd.DataFrame({('Quarter', 'Range'): {0: 'A', 1: 'B'}, ('Q1(0.25)', 'Low'): {0: 0, 1: 0}, ('Q1(0.25)', 'High'): {0: 10, 1: 630}, ('Q2(0.5)', 'Low'): {0: 10, 1: 630}, ('Q2(0.5)', 'High'): {0: 50, 1: 3000}, ('Q3(0.75)', 'Low'): {0: 50, 1: 3000}, ('Q3(0.75)', 'High'): {0: 100, 1: 8500}, ('Q4(1.0)', 'Low'): {0: 100, 1: 8500}, ('Q4(1.0)', 'High'): {0: 'np.inf', 1: 'np.inf'}})
As the above dataframe if the value of column A is between 0-10 replace the value with 0.25, if the value is between 10-50 replace the value with 0.5. Similarly, we have to repeat for all values and columns.
Expected output:
If the value is 12 for A and 3210 for column B.
df2 = pd.DataFrame({'Column': {0: 'A', 1: 'B'}, 'Prob': {0: 0.5, 1: 0.75}})
How to do it?
Here is my data:
{'SystemID': {0: '95EE8B57',
1: '5F891F03',
2: '5F891F03',
3: '5F891F03'},
'Day': {0: '06/08/2018', 1: '05/08/2018', 2: '04/08/2018', 3: '05/08/2018'},
'AlarmClass-S': {0: 4, 1: 2, 2: 4, 3: 0},
'AlarmClass-ELM': {0: 0, 1: 0, 2: 0, 3: 2}}
I would like to perform an aggregation and filtering which in SQL would be formulated as
SELECT SystemID, COUNT(*) as count FROM table GROUP BY SystemID HAVING COUNT(*) > 2
Thus the result shall be
{'SystemID': {0: '5F891F03'},
'count': {0: '3'}}
How to do this in pandas?
You can use groupby and count, then filter at the end.
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2'))
SystemID count
0 5F891F03 3
(df.groupby('SystemID', as_index=False)['SystemID']
.agg({'count': 'count'})
.query('count > 2')
.to_dict())
# {'SystemID': {0: '5F891F03'}, 'count': {0: 3}}
Here's my starting dataframe:
StartDF = pd.DataFrame({'A': {0: 1, 1: 1, 2: 2, 3: 4, 4: 5, 5: 5, 6: 5, 7: 5}, 'B': {0: 2, 1: 2, 2: 4, 3: 2, 4: 2, 5: 4, 6: 4, 7: 5}, 'C': {0: 10, 1: 1000, 2: 250, 3: 100, 4: 550, 5: 100, 6: 3000, 7: 250}})
I need to create a list of individual dataframes based on duplicate values in columns A and B, so it should look like this:
df1 = pd.DataFrame({'A': {0: 1, 1: 1}, 'B': {0: 2, 1: 2}, 'C': {0: 10, 1: 1000}})
df2 = pd.DataFrame({'A': {0: 2}, 'B': {0: 4}, 'C': {0: 250}})
df3 = pd.DataFrame({'A': {0: 4}, 'B': {0: 2}, 'C': {0: 100}})
df4 = pd.DataFrame({'A': {0: 5}, 'B': {0: 2}, 'C': {0: 550}})
df5 = pd.DataFrame({'A': {0: 5, 1: 5}, 'B': {0: 4, 1: 4}, 'C': {0: 100, 1: 3000}})
df6 = pd.DataFrame({'A': {0: 5}, 'B': {0: 5}, 'C': {0: 250}})
I've seen a lot of answers that explain how to DROP duplicates, but I need to keep the duplicate values because the information in column C will usually be different between rows regardless of duplicates in columns A and B. All of the row data needs to be preserved in the new dataframes.
Additional note, the starting dataframe (StartDF) will change in length, so each time this is run, the number of individual dataframes created will be variable. Ultimately, I need to print the newly created dataframes to their own csv files (I know how to do this part). Just need to know how to break out the data from the original dataframe in an elegant way.
You can use a groupby, iterate over each group and build a list using a list comprehension.
df_list = [g for _, g in df.groupby(['A', 'B'])]
print(*df_list, sep='\n\n')
A B C
0 1 2 10
1 1 2 1000
A B C
2 2 4 250
A B C
3 4 2 100
A B C
4 5 2 550
A B C
5 5 4 100
6 5 4 3000
A B C
7 5 5 250