Say you have something like the following code:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file')
How would you chain an order by on to that object?
df = df.orderBy(df.some_col)
To make it something like:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file').orderBy(?.some_col)
You can give the column name as a string or a list of strings:
df = sqlContext.read.parquet('s3://somebucket/some_parquet_file').orderBy("some_col")
Related
I have a Pandas dataframe with a lot of columns looking like p_d_d_c0, p_d_d_c1, ... p_d_d_g1, p_d_d_g2, ....
df =
a b c p_d_d_c0 p_d_d_c1 p_d_d_c2 ... p_d_d_g0 p_d_d_g1 ...
All these columns, which confirm to the regex need to be selected and their datatypes need to be changed from object to float. In particular, columns look like p_d_d_c* and p_d_d_g* are they are all object types and I would like to change them to float types. Is there a way to select columns in bulk by using regular expression and change them to float types?
I tried the answer from here, but it takes a lot of time and memory as I have hundreds of these columns.
df[df.filter(regex=("p_d_d_.*"))
I also tried:
df.select(lambda col: col.startswith('p_d_d_g'), axis=1)
But, it gives an error:
AttributeError: 'DataFrame' object has no attribute 'select'
My Pandas version is 1.0.1
So, how to select columns in bulk and change their data types using regex?
Try this:
import pandas as pd
# sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4], "abc":[5,4,3,2]})
# select all columns which have co in it
floatcols = [col for col in df.columns if "co" in col]
for floatcol in floatcols:
df[floatcol] = df[floatcol].astype(float)
From the same link, and with some astype magic.
column_vals = df.columns.map(lambda x: x.startswith("p_d_d_"))
train_temp = df.loc(axis=1)[column_vals]
train_temp = train_temp.astype(float)
EDIT:
To modify the original dataframe, do something like this:
column_vals = [x for x in df.columns if x.startswith("p_d_d_")]
df[column_vals] = df[column_vals].astype(float)
I've the below code
import pandas as pd
Orders = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Orders')
Returns = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Returns')
Sum_value = pd.DataFrame(Orders['Sales']).sum
Orders_Year = pd.DatetimeIndex(Orders['Order Date']).year
Orders.merge(Returns, how="inner", on="Order ID")
which gives the output as below
My Requirement is i would like to use groupby and would like to see the output as below
Can some one please help me how to use groupby in my above code, it means i would like to see everything in the single line by using groupby
Regards,
Bharath
You can do by selecting column then define to a new dataframe
grouped = pd.DataFrame()
groupby = ['Year','Segment','Sales']
for i in groupby:
grouped[i] = Orders[i]
I want to add a list as a new column to a dataframe. I am doing:
df['Intervention'] = interventionList
It gives me
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
I read Pandas add a series to dataframe column where the accepted answer is very similar.
I believe one option would be to use:
df.assign(Intervention = interventionList)
or to make a copy of the dataframe:
df2 = df.copy()
You can try something like this
import pandas as pd
li = [1,2,3,4,5]
li2 =[6,7,8,9,10]
df = pd.DataFrame()
#Using pd.Series to add lists to dataframe
df['col1'] = pd.Series(li)
df['col2'] = pd.Series(li2)
df
I have 30 data frames and each df has a column. The column names are big and look something like as given below:
df1.columns = ['123.ABC_xyz_1.CB_1.S_01.M_01.Pmax']
df2.columns = ['123.ABC_xyz_1.CB_1.S_01.M_02.Pmax']
..
df30.columns = ['123.ABC_xyz_1.CB_1.S_01.M_30.Pmax']
I want to trim their names and I want them finally to be something like as given below:
df1.columns = ['M1Pmax']
df2.columns = ['M2Pmax']
..
df30.columns = ['M30Pmax']
I thought of something like this:
df_list = [df1,df2,....,df30]
for i,k in enumerate(df_list):
df_list[i].columns = [col_name+'_df[i]{}'.format(df_list[i]) for col_name in df_list[i].columns]
However, my above code is not working properly.
How to do it?
You are trying to use the dataframe itself in the name which is not gonna work. I am assuming you were trying to use the name of the dataframe. You are also not shortening anything in your code but just making it longer. I would suggest something like:
df_list = [df1,df2,....,df30]
for i, k in enumerate(df_list):
df_list[i].columns = ['M{}_'.format(i)+col_name.split(".")[-1] for col_name in df_list[i].columns]
IIUC
l=[]
for i in df_list:
i.columns=i.columns.str.split('.').str[-2:].str.join('').str.replace('_','')
l.append(i)
Why not doing it like this?
# List of all dataframes
df_list = [df1,df2,....,df30]
# List of Columns names for all dataframes
colum_names =[['M1Pmax'],['M2Pmax'],...., ['M30Pmax']]
for i in range(len(df_list)):
df_list[i].columns = [colum_names[i]]
Hope this will help you!.
Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])