I am trying to pivot some data in Python pandas package by using the pivot_table feature but as part of this I have a specific, bespoke order that I want to see my columns returned in - determined by a Sort_Order field which is already in the dataframe. So for test example with:
raw_data = {'Support_Reason' : ['LD', 'Mental Health', 'LD', 'Mental Health', 'LD', 'Physical', 'LD'],
'Setting' : ['Nursing', 'Nursing', 'Residential', 'Residential', 'Community', 'Prison', 'Residential'],
'Setting_Order' : [1, 1, 2, 2, 3, 4, 2],
'Patient_ID' : [6789, 1234, 4567, 5678, 7890, 1235, 3456]}
Data = pd.DataFrame(raw_data, columns = ['Support_Reason', 'Setting', 'Setting_Order', 'Patient_ID'])
Data
Then pivot:
pivot = pd.pivot_table(Data, values='Patient_ID', index=['Support_Reason'],
columns=['Setting'], aggfunc='count',dropna = False)
pivot = pivot.reset_index()
pivot
This is exactly how I want my table to look except that the columns have defaulted to A-Z ordering. I would like them to be ordered Ascending as per the Setting_Order column - so that would be order of Nursing, Residential, Community then Prison. Is there some additional syntax that I could add to my pd.pivot_table code would make this possible please?
I realise there are a few different work-arounds for this, the simplest being re-ordering the columns afterwards(!) but I want to avoid having to hard-code column names as these will change over time (both the headings and their order) and the Setting and Setting_Order fields will be managed in a separate reference table. So any form of answer that will avoid having to list Settings in code would be ideal really.
Try:
ordered = df.sort_values("Setting_Order")["Setting"].drop_duplicates().tolist()
pivot = pivot[list(pivot.columns.difference(ordered))+ordered]
col_order = list(Data.sort_values('Setting_Order')['Setting'].unique())
pivot[col_order+['Support_Reason']]
Does this help?
Related
Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
I have a not too large DF. I want to add a column that looks up the value in the column of that specific row. So in the example below, the value should come from the column names 'PA1.13'
example = {'Honda Civic': [1],
'Toyota': [0],
'valuetolookup': ['Honda Civic'],
'Result should be': [1]
}
As you can see the column has two levels. I cannot seem to find how to make a second column level from scratch, but here I hope that I can work it out if someone wants to use my example code to solve it :-)
You can use a simple apply() to extract data like you want:
import pandas as pd
example = {'Honda Civic': [1,3],
'Toyota': [0,2],
'valuetolookup': ['Honda Civic','Toyota'],
'Result should be': [1,2]
}
df = pd.DataFrame(example)
#In the pandas apply, i use the "valuetolookup" column value to get the column name
df["Result"] = df.apply(lambda x : x[x["valuetolookup"]],axis=1)
I added another row to show you that you can use different columns to lookup :)
I have two data frames, let's say A and B. A has the columns ['Name', 'Age', 'Mobile_number'] and B has the columns ['Cell_number', 'Blood_Group', 'Location'], with 'Mobile_number' and 'Cell_number' having common values. I want to join the 'Location' column only onto A based off the common values in 'Mobile_number' and 'Cell_number', so the final DataFrame would have A={'Name':,'Age':,'Mobile_number':,'Location':]
a = {'Name': ['Jake', 'Paul', 'Logan', 'King'], 'Age': [33,43,22,45], 'Mobile_number':[332,554,234, 832]}
A = pd.DataFrame(a)
b = {'Cell_number': [832,554,123,333], 'Blood_group': ['O', 'A', 'AB', 'AB'], 'Location': ['TX', 'AZ', 'MO', 'MN']}
B = pd.DataFrame(b)
Please suggest. A colleague suggest to use pd.Join but I don't understand how.
Thank you for your time.
the way i see it, you want to merge a dataframe with a part of another dataframe, based on some common column.
first you have to make sure the common column share the same name:
B['Mobile_number'] = B['Cell_number']
then create a dataframe that contains only the relevant columns (the indexing column and the relevant data column):
B1 = B[['Mobile_number', 'Location']]
and at last you can merge them:
merged_df = pd.merge(A, B1, on='Mobile_number')
note that this usage of pd.merge will take only rows with Mobile_number value that exists in both dataframes.
you can look at the documentation of pd.merge to change how exactly the merge is done, what to include etc..
I'm having an issue with the output of my spark dataframe. The file can range from few GB to 50+GB
SparkDF = spark.read.format("csv").options(header="true", delimiter="|", maxColumns="100000").load(my_file.csv)
This give me the correct DF that I want. But as per requirement I need to have as key the column name and all the values in a set related to that key.
For example:
df = {'col1': ['1', '2', '3', '4'], 'col2': ['Jean', 'Cecil', 'Annie', 'Maurice'], 'col3': ['test', 'aaa', 'bbb', 'ccc','ddd']}
df = pd.DataFrame(data=d)
Should give me at the end:
{'col1': {'1', '2', '3', '4'},'col2': {'Jean', 'Cecil', 'Annie', 'Maurice'},'col3': {'test', 'aaa', 'bbb', 'ccc','ddd'}
I've implemented the following:
def columnDict(dataFrame):
colDict = dict(zip(dataFrame.schema.names, zip(*dataFrame.collect())))
return colDict if colDict else dict.fromkeys(dataFrame.schema.names, ())
However, it returned me a dict with a tuple as value and not a set as I require.
I would like either to convert the tuple in the dictionary into a set or just directly get a dictionary as a set as an output of my function.
EDIT:
For the full requirements:
Beside the dictionary mentioned above, there is another one that contains similar data for checking.
Means that the file that I load to a spark DF and transform into a dictionary contains data that must be checked against the other dictionary.
The goal is to check every key from my dict (the loaded file), against the check dictionary, first to see if they exist, then if it exist to check if the values of the keys are a subset of the check values.
If I load the check data in a dataframe it would look like this : (note that I may not be able to change the fact that it's a dict, I will see if I can modify from dict to spark df)
df = {'KeyName': ['col1', 'col2', 'col3'], 'ValueName': ['1, 2, 3, 4', 'Jean, Cecil, Annie, Maurice, Annie, Maurice', 'test, aaa, bbb, ccc,ddd,eee']}
df = pd.DataFrame(data=df)
print(df)
KeyName ValueName
0 col1 1, 2, 3, 4
1 col2 Jean, Cecil, Annie, Maurice, Annie, Maurice
2 col3 test, aaa, bbb, ccc,ddd,eee
So at the end, the data in my file should be a subset of a row that have the same KeyName as my dict.
I'm slightly stuck with legacy code and I'm little bit struggling to migrate it to spark databricks.
EDIT 2:
hopefully this will work. I uploaded the 2 files with modified data:
https://filebin.net/1rnnvqn2b0ww7qc8
FakeData.csv contains the data that I loaded on my side with the above code and must be a subset of the second one.
FakeDataChecker.csv contains the data that is the actual full set available
EDIT 3:
Forgot to add that all empty string in the FakeData should not be taken in account as well as the one in FakeDataChecker
So I'm not sure I have understood your usecase perfectly. But let's try with a first draft.
From what I'm understanding, you have a first file with all your data. And a file checker with keys that you need to have in the data foreach column. And additional keys present in the data should be filtered out.
This could be done with inner join between your initial data and the data checker. If there aren't too many keys in the data checker, Spark should automatically broadcast the data checker dataframe for optimized joins.
Here is the first draft of the code, this isn't yet completely automated waiting for your first questions and remarks.
First let's import the needed functions and the data:
from pyspark.sql.functions import col
from pyspark.sql import Window
spark.sql("set spark.sql.caseSensitive=true")
data = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeData.csv")
.na.drop()
)
data_checker = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeDataChecker.csv")
.na.drop(subset=["ValueName"])
)
We drop null values as you need, you can specify the wanted columns with the subset keyword
Then let's prepare the join dataframes
data_checker_date = data_checker.filter(col("KeyName") == "DATE").select(col("ValueName").alias("date"))
data_checker_location = data_checker.filter(col("KeyName") == "LOCATION").select(col("ValueName").alias("location"))
data_checker_location_id = data_checker.filter(col("KeyName") == "LOCATIONID").select(col("ValueName").alias("locationid"))
data_checker_type = data_checker.filter(col("KeyName") == "TYPE").select(col("ValueName").alias("type"))
We need to alias the column during the joins to avoid duplicated column names. And we specify the case sensitive option for when we drop the columns, so that we don't drop the initial ones in CAPS.
Finally we filter out, through inner join all keys not present in the data checker:
(
data
.join(data_checker_date, data.DATE == data_checker_date.date)
.join(data_checker_location, data.LOCATION == data_checker_location.location)
.join(data_checker_location_id, data.LOCATIONID == data_checker_location_id.locationid)
.join(data_checker_type, data.TYPE == data_checker_type.type)
.drop("date", "location", "locationid", "type")
.show()
)
In next steps, we can automate this through retrieving the distinct keyNames of the columns (e.g.: "DATE", "LOCATION", etc...) So we can don't have to copy paste the code 4 times or X times in the future.
Something in the line of:
from pyspark.sql.functions import collect_set
distinct_keynames = data_checker.select(collect_set('KeyName').alias('KeyName')).first()['KeyName']
for keyname in distinct_keynames:
etc... implement the logic of chaining joins
I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.