pyspark - Chaining a .orderBy to a .read method - python

Say you have something like the following code:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file')
How would you chain an order by on to that object?
df = df.orderBy(df.some_col)
To make it something like:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file').orderBy(?.some_col)

You can give the column name as a string or a list of strings:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file').orderBy("some_col")

Related

Change DataTypes of Pandas Columns by selecting columns by regex

I have a Pandas dataframe with a lot of columns looking like p_d_d_c0, p_d_d_c1, ... p_d_d_g1, p_d_d_g2, ....
df =
a b c p_d_d_c0 p_d_d_c1 p_d_d_c2 ... p_d_d_g0 p_d_d_g1 ...
All these columns, which confirm to the regex need to be selected and their datatypes need to be changed from object to float. In particular, columns look like p_d_d_c* and p_d_d_g* are they are all object types and I would like to change them to float types. Is there a way to select columns in bulk by using regular expression and change them to float types?
I tried the answer from here, but it takes a lot of time and memory as I have hundreds of these columns.
df[df.filter(regex=("p_d_d_.*"))
I also tried:
df.select(lambda col: col.startswith('p_d_d_g'), axis=1)
But, it gives an error:
AttributeError: 'DataFrame' object has no attribute 'select'
My Pandas version is 1.0.1
So, how to select columns in bulk and change their data types using regex?
Try this:
import pandas as pd
# sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4], "abc":[5,4,3,2]})
# select all columns which have co in it
floatcols = [col for col in df.columns if "co" in col]
for floatcol in floatcols:
df[floatcol] = df[floatcol].astype(float)
From the same link, and with some astype magic.
column_vals = df.columns.map(lambda x: x.startswith("p_d_d_"))
train_temp = df.loc(axis=1)[column_vals]
train_temp = train_temp.astype(float)
EDIT:
To modify the original dataframe, do something like this:
column_vals = [x for x in df.columns if x.startswith("p_d_d_")]
df[column_vals] = df[column_vals].astype(float)

Python Pandas Group by

I've the below code
import pandas as pd
Orders = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Orders')
Returns = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Returns')
Sum_value = pd.DataFrame(Orders['Sales']).sum
Orders_Year = pd.DatetimeIndex(Orders['Order Date']).year
Orders.merge(Returns, how="inner", on="Order ID")
which gives the output as below
My Requirement is i would like to use groupby and would like to see the output as below
Can some one please help me how to use groupby in my above code, it means i would like to see everything in the single line by using groupby
Regards,
Bharath
You can do by selecting column then define to a new dataframe
grouped = pd.DataFrame()
groupby = ['Year','Segment','Sales']
for i in groupby:
grouped[i] = Orders[i]

What is the correct way to add a list as a column to a dataframe?

I want to add a list as a new column to a dataframe. I am doing:
df['Intervention'] = interventionList
It gives me
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
I read Pandas add a series to dataframe column where the accepted answer is very similar.
I believe one option would be to use:
df.assign(Intervention = interventionList)
or to make a copy of the dataframe:
df2 = df.copy()
You can try something like this
import pandas as pd
li = [1,2,3,4,5]
li2 =[6,7,8,9,10]
df = pd.DataFrame()
#Using pd.Series to add lists to dataframe
df['col1'] = pd.Series(li)
df['col2'] = pd.Series(li2)
df

Renaming 30 dataframes columns

I have 30 data frames and each df has a column. The column names are big and look something like as given below:
df1.columns = ['123.ABC_xyz_1.CB_1.S_01.M_01.Pmax']
df2.columns = ['123.ABC_xyz_1.CB_1.S_01.M_02.Pmax']
..
df30.columns = ['123.ABC_xyz_1.CB_1.S_01.M_30.Pmax']
I want to trim their names and I want them finally to be something like as given below:
df1.columns = ['M1Pmax']
df2.columns = ['M2Pmax']
..
df30.columns = ['M30Pmax']
I thought of something like this:
df_list = [df1,df2,....,df30]
for i,k in enumerate(df_list):
df_list[i].columns = [col_name+'_df[i]{}'.format(df_list[i]) for col_name in df_list[i].columns]
However, my above code is not working properly.
How to do it?
You are trying to use the dataframe itself in the name which is not gonna work. I am assuming you were trying to use the name of the dataframe. You are also not shortening anything in your code but just making it longer. I would suggest something like:
df_list = [df1,df2,....,df30]
for i, k in enumerate(df_list):
df_list[i].columns = ['M{}_'.format(i)+col_name.split(".")[-1] for col_name in df_list[i].columns]
IIUC
l=[]
for i in df_list:
i.columns=i.columns.str.split('.').str[-2:].str.join('').str.replace('_','')
l.append(i)
Why not doing it like this?
# List of all dataframes
df_list = [df1,df2,....,df30]
# List of Columns names for all dataframes
colum_names =[['M1Pmax'],['M2Pmax'],...., ['M30Pmax']]
for i in range(len(df_list)):
df_list[i].columns = [colum_names[i]]
Hope this will help you!.

Read dataframe in pandas skipping first column to read time series data

Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])

Categories

Resources