I am working on a python data analytics.
first. this is the raw data
I want to get a result like
and My code is like
df_sellout.groupby("Brand")[:,0:4].sum()
But this doesn't work.
I want to use [:,0:4] because I have another massive data which I can't write all the columns name.
Can anyone help me please?
Try this;
df_sellout.groupby("Brand")[df_sellout.columns[2:].sum()
Put indexing by iloc in front of groupby:
df_sellout.iloc[:,0:4].groupby("Brand").sum()
I want to create a new column(not from an existing columns) having MapType(StringType(),StringType()) how do I achieve this.
So far, I was able to write the below code but it's not working.
df = df.withColumn("normalizedvariation",lit(None).cast(MapType(StringType(),StringType())))
Also I would like to know the different methods to achieve the same, Thank you.
This is one way you can try, newMapColumn here woukld be the name of the Map column.
You can see the output I got below. If that is not what you are looking for, please let me know. Thanks!
Also you will have to import the functions using the below line:
from pyspark.sql.functions import col,lit,create_map
I'm quite new to PySpark and coming from SAS I still don't get how to handle parameters (or Macro Variables in SAS terminology).
I have a date parameter like "202105" and want to add it as a String Column to a Dataframe.
Something like this:
date = 202105
df = df.withColumn("DATE", lit('{date}'))
I think it's quite trivial but so far, I didn't find an exact answer to my problem, maybe it's just too trivial...
Hope you guys can help me out. Best regards
You can use string interpolations i.e. {}.format() (or) f'{}'.
Example:
df.withColumn("DATE", lit("{0}".format(date)))
df.withColumn("DATE", lit("{}".format(date)))
#or
df.withColumn('DATE', lit(f'{date}'))
I'm trying to run a crosstab on my dataframe called "d_recent" using the following line of code:
pd.crosstab(d_recent['BinnedAge'],' d_recent['APBI']')
The output I am getting is this:
|Age Bin|Brachytherapy|EBRT|IORT|
|-------|-------------|----|----|
|51-60|1|1|0|
|71-80|86|62|11|
|61-70|2578|723|276|
|41-50|9386|2049|1188|
|81-90|13860|3257|2449|
|31-40|7725|2078|1628|
|21-30|1958|615|425|
This is wrong. What it should look like is:
|Age Bin|Brachytherapy|EBRT|IORT|
|-------|-------------|----|----|
|21-30|1|1|0|
|31-40|86|62|11|
|41-50|2578|723|276|
|51-60|9386|2049|1188|
|61-70|13860|3257|2449|
|71-80|7725|2078|1628|
|81-90|1958|615|425|
Any idea what is going on here and how I can fix it? I can tell that the order of the rows in the first table is related to the order the specific bins are encountered in my dataframe. I can get the correct output if I sort by age prior to running the crosstab, but this isn't a preferable solution because I need to do this with multiple variables. Thanks!
I have dataframes that have the same column names as follows
df1=pd.DataFrame({'Group1':['a','b','c','d','e'],'Group2':["f","g","h","i","j"],'Group3':['k','L','m','n',"0"]})
df2=pd.DataFrame({'Group1':[0,0,2,1,0],'Group2':[1,2,0,0,0],'Group3':[0,0,0,1,1]})
For some reasons, I want to concatenate these dataframe as follows.
dfnew=pd.concat([df1[["Group1","Group2"]], df2[["Group1","Group2"]]], axis=1)
I want to rename the columns of this new dataframe, thus tried below.
dfnew.columns={"1","2","3","4"}
I expected the order of the columns would be 1,2,3,4, but the actual result was 4,3,1,2 instead.
I do not know why this happens.
If someone could advise me, I would appreciate it very much.
In addition, I need to concatenate many dataframes for future work.
(i.e. concatenate df1,df2, df3...df1000).
Is there a good way to rename columns as "1,2,3,4.....1000"? because typing these numbers is lots of work.
Thank you.
To rename columns you can use this syntax:
dfnew.columns=["1","2","3","4"]
In future , if you want to rename 1000 columns as you have asked maybe you can do something like this:
dfnew.columns=[str(i) for i in range(1,1001)]
Use the brackets to ensure that the columns order is preserved
dfnew.columns=["1","2","3","4"]