I have an Access Query which I want to convert into Python Script:
SELECT
[Functional_Details].Customer_No,
Sum([Functional_Details].[SUM(Incoming_Hours)]) AS [SumOfSUM(Incoming_Hours)],
Sum([Functional_Details].[SUM(Incoming_Minutes)]) AS [SumOfSUM(Incoming_Minutes)],
Sum([Functional_Details].[SUM(Incoming_Seconds)]) AS [SumOfSUM(Incoming_Seconds)],
[Functional_Details].Rate,
[Functional_Details].Customer_Type
FROM [Functional_Details]
WHERE(
(([Functional_Details].User_ID) Not In ("IND"))
AND
(([Functional_Details].Incoming_ID)="Airtel")
AND
(([Functional_Details].Incoming_Category)="Foreign")
AND
(([Functional_Details].Outgoing_ID)="Airtel")
AND
(([Functional_Details].Outgoing_Category)="Foreign")
AND
(([Functional_Details].Current_Operation)="NO")
AND
(([Functional_Details].Active)="NO")
)
GROUP BY [Functional_Details].Customer_No, [Functional_Details].Rate, [Functional_Details].Customer_Type
HAVING ((([Functional_Details].Customer_Type)="Check"));
I have Functional_Details stored in a dataframe: df_functional_details
I am not able to understand how to proceed with the python script.
So far I have tried:
df_fd_temp=df_functional_details.copy()
if(df_fd_temp['User_ID'] != 'IND'
and df_fd_temp['Incoming_ID'] == 'Airtel'
and df_fd_temp['Incoming_Category'] == 'Foreign'
and df_fd_temp['Outgoing_ID'] == 'Airtel'
and df_fd_temp['Outgoing_Category'] == 'Foreign'
and df_fd_temp['Current_Operation'] == 'NO'
and df_fd_temp['Active'] == 'NO'):
df_fd_temp.groupby(['Customer_No','Rate','Customer_Type']).groups
df_fd_temp[df_fd_temp['Customer_Type'].str.contains("Check")]
First, select the rows where the conditions apply (note the parentheses and & instead of and):
df_fd_temp = df_fd_temp[(df_fd_temp['User_ID'] != 'IND') &
(df_fd_temp['Incoming_ID'] == 'Airtel') &
(df_fd_temp['Incoming_Category'] == 'Foreign') &
(df_fd_temp['Outgoing_ID'] == 'Airtel') &
(df_fd_temp['Outgoing_Category'] == 'Foreign') &
(df_fd_temp['Current_Operation'] == 'NO') &
(df_fd_temp['Active'] == 'NO')]
Then, do the group-by logic:
df_grouped = df_fd_temp.groupby(['Customer_No','Rate','Customer_Type'])
You now have a groupby object, which you can further manipulate and filter:
df_grouped.filter(lambda x: "Check" in x['Customer_Type'])
You might need to tweak the group filtering based on what your actual dataset looks like.
Further reading:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.filter.html
Related
I am trying to add an if condition in F. when in a pyspark column
my code:
df = df.withColumn("column_fruits",F.when(F.col('column_fruits') == "Berries"
if("fruit_color")== "red":
return "cherries"
elif("fruit_color") == "pink":
return "strawberries"
else:
return "balackberries").otherwise("column_fruits")
I want to first filter out berries and change fruit names according to color. And all the remaining fruits remain the same.
Can anyone tell me if this is a valid way of writing withColumn code?
This would work
df.withColumn("column_fruits", F.when((F.col('column_fruits') == "Berries") & (F.col('fruit_color') == "red"), "cherries")\
.when((F.col('column_fruits') == "Berries") & (F.col('fruit_color') == "pink"), "strawberries")\
.otherwise("blackberries"))\
.show()
Sample Input/Output:
I would like to print out the rows from Excel where either the data exists or does not under a specific column. Whenever I run the code, I get this:
`Series([], dtype: int64)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: FutureWarning:
Automatic reindexing on DataFrame vs Series comparisons is deprecated and will
raise ValueError in a future version. Do `left, right = left.align(right,
axis=1, copy=False)` before e.g. `left == right`
My snippet is:
'at5 = input("Erkély igen?: ")
if at5 == 'igen':
erkely = tables2[~tables2['balcony'].isnull()]
else:
erkely = tables2[~tables2['balcony'].notnull()]
#bt = tables2[(tables2['lakas_tipus'] ==at1) & (tables2['nm2'] >= at2) &
(tables2['nm2'] < at3 ) & (tables2['room'] == at4 ) & (tables2['balcony'] == erkely
)]'
Any idea how to approach this problem? I'm not getting the output I want.
import pandas as pd
import numpy as np
df = pd.read_csv("adult.data.csv")
print("data shape: "+str(data.shape))
print("number of rows: "+str(data.shape[0]))
print("number of cols: "+str(data.shape[1]))
print(data.columns.values)
datahist = {}
for index, row in data.iterrows():
k = str(row['age']) + str(row['sex']) +
str(row['workclass']) + str(row['education']) +
str(row['marital-status']) + str(row['race'])
if k in datahist:
datahist[k] += 1
else:
datahist[k] = 1
uniquerows = 0
for key, value in datahist.items():
if value == 1:
uniquerows += 1
print(uniquerows)
for key, value in datahist.items():
if value == 1:
print(key)
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
I have been trying to get the above code to work.
I have limited experience in coding but it seems like the issue lies with some of the columns being objects. The int64 columns work just fine when it comes to filtering.
Any assistance will be much appreciated!
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
Firstly you are attemping to use Male variable, you probably meant string, i.e. it should be 'Male', secondly observe [ and ] placement, you are extracting part of DataFrame with age equal 58 then extracting part of DataFrame with sex equal Male and then try to use bitwise and. You should probably use & with conditions rather than pieces of DataFrame that is
df.loc[(data['age'] == 58) & (data['sex'] == 'Male')]
The int64 columns work just fine because you've specified the condition correctly as:
data['age'] == 58
However, the object column condition data['sex'] == Male should be specified as a string:
data['sex'] == 'Male'
Also, I noticed that you have loaded the dataframe df = pd.read_csv("adult.data.csv"). Do you mean this instead?
data = pd.read_csv("adult.data.csv")
The query at the end includes 2 conditions, and should be enclosed in brackets within the square brackets [ ] filter. If the dataframe name is data (instead of df), it should be:
data.loc[ (data['age'] == 58]) & (data['sex'] == Male) ]
every month I get a dataframe , so every month I will have to do some adjusts to the dataframe, I would like to create a function for just apply it on every dataframe without create the code again.
I have for the first dataframe, called enero:
for i in range(0,len(enero)):
if enero.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
enero.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif enero.loc[i,"PROVEEDOR"] == "PEPITO" and enero.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
enero.loc[i,"MARCA"]="PINTURAS"
For the second dataframe, called febrero:
for i in range(0,len(febrero)):
if febrero.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
febrero.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif febrero.loc[i,"PROVEEDOR"] == "PEPITO" and febrero.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
febrero.loc[i,"MARCA"]="PINTURAS"
So, as not to repeat the code every month, I would like to create a function:
def ajustemarca(df,VENDEDOR_CLIENTE,MARCA,PROVEEDOR):
for i in range(0,len(df)):
if df.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
df.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif df.loc[i,"PROVEEDOR"] == "PEPITO" and df.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
df.loc[i,"MARCA"]="PINTURAS"
return df.loc[i,"MARCA"]
Then, I am calling the function:
enero.apply(ajustemarca)
febrero.apply(ajustemarca)
But, it does not work. How can I do this function?
I share an answer that someone wrote here, but it was deleted :(
The answer was perfect and now the code work:
def ajustemarca(df):
for i in range(0,len(df)):
if df.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
df.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif df.loc[i,"PROVEEDOR"] == "PEPITO." and df.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
df.loc[i,"MARCA"]="PINTURAS"
ajustemarca(enero)
ajustemarca(febrero)
I know there are several hundred solutions on this, but I was wondering if there is a smarter way to fill the panda's data frame missing the age column based on lengthy certain conditions as folows.
mean_value = df[(df["Survived"]== 1) & (df["Pclass"] == 1) & (df["Sex"] == "male")
& (df["Embarked"] == "C") & (df["SibSp"] == 0) & (df["Parch"] == 0)].Age.mean().round(2)
df = df.assign(
Age=np.where(df.Survived.eq(1) & df.Pclass.eq(1) & df.Sex.eq("male") & df.Embarked.eq("C") &
df.SibSp.eq(0) & df.Parch.eq(0) & df.Age.isnull(), mean_value, df.Age)
)
Repeating the following for all 6 columns above, with all categorical combinations is too long and bulky, I was wondering if there is a smarter way to do this?
#Ben.T answer:
If I understood your method correctly, this is the "verbose version" of it ?
for a in np.unique(df.Survived):
for b in np.unique(df.Pclass):
for c in np.unique(df.Sex):
for d in np.unique(df.SibSp):
for e in np.unique(df.Parch):
for f in np.unique(df.Embarked):
mean_value = df[(df["Survived"] == a) & (df["Pclass"] == b) & (df["Sex"] == c)
& (df["SibSp"] == d) & (df["Parch"] == e) & (df["Embarked"] == f)].Age.mean()
df = df.assign(Age=np.where(df.Survived.eq(a) & df.Pclass.eq(b) & df.Sex.eq(c) & df.SibSp.eq(d) &
df.Parch.eq(e) & df.Embarked.eq(f) & df.Age.isnull(), mean_value, df.Age))
which is equivalent to this?
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))
You can create a variable that combines all of your criteria, and then you can use the ampersand to add more criteria later.
Note, in the seaborn titanic dataset, where I got the data from, the column names are lowercase.
criteria = ((df["survived"]== 1) &
(df["pclass"] == 1) &
(df["sex"] == "male") &
(df["embarked"] == "C") &
(df["sibsp"] == 0) &
(df["parch"] == 0))
fillin = df.loc[criteria, 'age'].mean()
df.loc[criteria & (df['age'].isnull()), 'age'] = fillin
I guess groupby.transform can do it. It creates for each row the mean over the group of all the columns in the groupby, and it does it for all the combinations possibles at once. Then using fillna with the serie created will fill missing value with the mean of the group with same charateristics.
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))