How to check if a serial number starts with 0 - python

Im excluding rows from my df that fill certain conditions
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20))]
I would like to also exclude, the numbers that start with 0 in the column 'Serial'
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'] == range(0) == 0))]
I tried the above, no result.

You will probably want to use df.str.startswith to check the first character:
df[ ~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'].str.startswith('0')))]
The current expression df['Serial'] == range(0) == 0 is meaningless. It is equivalent to df['Serial'] == range(0) and range(0) == 0. Clearly, neither of those is related to comparing the first character of the string to '0' (as opposed to 0).

Related

Loop through a few commands using a function

I have a loop where I constantly got an error message.
print(((df1['col1_df'] == 0) & (df1['col2_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col3_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col4_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col5_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col2_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col3_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col4_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col5_df'] == True)).sum())
I want to loop them through a function.
So far I have:
for i in range (2,5):
col = "col{}_df".format(i)
print(((df['col'] == 0) & (df['col'] == 2)).sum())
How can I number the df and let the df go through 1, 2, 3, 4 (like df1, df2 df3)
col is a variable. 'col' is a string. Having df['col'] doesn't refer to the variable col.
Your string format is done wrong: col = "col{}_df".format(i)
Also, range(2,5) will give you [2,5) not [2,5]. It is not inclusive.
You can express your entire code in two lines using a comprehension:
print(*( ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
print(*( ((df2[f'col1_df'] == 0) & (df2[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
The expression ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ) creates a generator object yielding the 4 expressions one by one.
The * scatters the elements in this generator and passes the arguments to print as you would have passed a comma-separated argument list. The end=\n ensures that each of these arguments are separated by a new line when output.

is there a way to make a value non-negative in dataframe

I'm new to coding and I'm using python pandas to practice making an algo-trading bot. This is my code.
for date in BTCtest.index:
if BTCtest.loc[date,'Shares'] == 0:
BTCtest.loc[date,'Shares'] = max(0,-5)
if BTCtest.loc[date, 'MA10'] > BTCtest.loc[date, 'MA50']:
BTCtest.loc[date, 'Shares'] = 1
elif BTCtest.loc[date, 'MA10'] < BTCtest.loc[date, 'MA50']:
BTCtest.loc[date, 'Shares'] = -1
BTCtest['Position'] = BTCtest['Shares'].cumsum()
BTCtest['Close1'] = BTCtest['Close'].shift(-1)
BTCtest['Profit'] = [BTCtest.loc[date, 'Close1'] - BTCtest.loc[date, 'Close'] if BTCtest.loc[date, 'Shares']==1 else 0 for date in BTCtest.index]
BTCtest['Profit'].plot()
print (BTCtest)
plt.axhline(y=0, color='red')
This is my code and I'm trying to not add shares when the position is 0.
I tried
if BTCtest.loc[date,'Shares'] == 0:
BTCtest.loc[date,'Shares'] = 0
if BTCtest.loc[date,'Shares'] == 0:
max(BTCtest.loc[date,'Shares'],-1)
Below is the result so far.
enter image description here
I don't want my position to go below 0.
I don't understand your code, but I understand your title.
To convert negative values into positive, we can multiply it by '-1' if it is less than 0. So, write
numberr=int(input("Enter a number: "))
if numberr<0:
numberr*=-1
print(numberr)
Hope this helps.
You should use .apply function instead of iterating over index. It will really simplify your code and it is the best practice.
Additionally, if you want to operate only with no zero rows do this:
BTCtest[BTCtest['Position'] == 0].apply(myfunction, axis=1)

How to insert the data that's causing an error, into a separate file, while processing the rest of the column?

I'm making a program that handles over 20 million rows and over 50 columns of data. I'm trying to check if the numbers in one of the columns are even or odd.
If even, insert 'E' into a different column; if odd, insert 'O' into the column.
DF_FILE_IN = pd.read_csv('3MB_2.txt',chunksize=1000,sep='\t',dtype=str,engine='c',header=0,encoding='latin-1')
out_fields = ['HSNBR','OEFLAG']
for DF_FILE in DF_FILE_IN:
df_out1 = pd.DataFrame(dtype='str',columns=out_fields)
df_out1['HSNBR'] = DF_FILE['ANumber'].map(lambda x: f'{x:0>6}')
df_out1.loc[pd.to_numeric(df_out1['HSNBR']).map(lambda x: (x % 2 == 0) & (x != 0)), 'OEFLAG'] = 'E'
df_out1.loc[pd.to_numeric(df_out1['HSNBR']).map(lambda x: (x % 2 != 0) & (x != 0)), 'OEFLAG'] = 'O'
But some data has letters, symbols, spaces, etc.
When I run it, an error pops up from this line of code:
df_out1.loc[pd.to_numeric(df_out1['HSNBR']).map(lambda x: (x % 2 == 0) & (x != 0)), 'OEFLAG'] = 'E'
and says (example):
ValueError: Unable to parse string "111 1/2g" at position 10
I'm using chunking to pull in the data (eg. 1 million rows at a time). I am wanting to put the data that causes the errors into a separate file. But when I use try except, it doesn't process the column of data, in that chunk.
How do I get the data and errors into a file, while letting the program keep processing the column?
What #BernardL means is to write a function like:
def even_odd(x):
x = str(x)
if x.isnumeric():
x = int(x)
if (x % 2 == 0) and (x != 0):
return 'E'
if (x % 2 != 0) and (x != 0):
return 'O'
return 'error'
And then apply it with:
df_out1['OEFLAG'] = df_out1['HSNBR'].map(even_odd)
And then you can take out the errors with:
df_out1[df_out1['OEFLAG'] == 'error'].to_csv('errors_file.csv')
df_out1 = df_out1[df_out1['OEFLAG'] != 'error']

How to form another colum in a pd.DataFrame out of different variables

I'm trying to make a new boolean variable by an if-statement with multiple conditions in other variables. But so far my many tries do not even work with variable as parameter.
head of used columns in data frame
I would really appreciate if anyone of you can see the Problem, I already searched for two days the whole World Wide Web. But as beginner I couldn't find the solution yet.
amount = df4['AnzZahlungIDAD']
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
diffrange = [(diff <=15) & (diff >= -15)]
special = df4[['Taxatoreneinsatz', 'Belegpruefereinsatz_rel', 'IntSVKZ', 'ExtTechSVKZ']]
First Method with list comprehension
label = []
label = [True if (amount[i] <= 1) & (time[i] <= timequantil) & (diff == diffrange) & (special == 'N') else False for i in label]
label
Second Method with iterrows()
df4['label'] = pd.Series([])
df4['label'] = [True if (row[amount] <= 1) & (row[time] <= timequantil) & (row[diff] == diffrange) & (row[special] == 'N') else False for row in df4.iterrows()]
df4['label']
3rd Method with Lambda function
df4.loc[:,'label'] = '1'
df4['label'] = df4['label'].apply([lambda c: True if (c[amount] <= 1) & (c[time] <= timequantil) & (c[diff] == diffrange) & (c[special]) == 'N' else False for c in df4['label']], axis = 0)
df4['label'].value_counts()
I expected that I get a varialbe "Label" in my dataframe df4 that is whether True or False.
Fewer tries gave me only all values = False or all = True even if I used only a single Parameter, which is impossible by the data.
First Method runs fine but Outputs: []
Second Method gives me following error: TypeError: tuple indices must be integers or slices, not Series
Third Method does not load at all.
IIUC, try this
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
# timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
df4['label'] = (df4['AnzZahlungIDAD'] <= 1) & (time <= time.quantile(.2)) & (diff <=15) & (diff >= -15) & (df['Belegpruefereinsatz_rel'] =='N') & (df['Taxatoreneinsatz'] =='N') & (df['ExtTechSVKZ'] =='N') & (df['IntSVKZ'] =='N')
Given your dataset i got following output
Anz dlz sch zal taxa bel int ext label
0 2 82 200 253.80 N N N J False
1 2 82 200 253.80 N N N J False
2 1 153 200 323.68 N J N N False
3 1 153 200 323.68 N J N N False
4 1 191 500 1252.12 N J N N False
Note: Don't mind the abbreviations used in column name

Speed up Pandas: find all columns which fullfill set of conditions

I have data represented using pandas DataFrame, which for example looks as follows:
| id | entity | name | value | location
where id is an integer value, entity is an integer , name is a string, value is an integer, and location is a string (for example US, CA, UK etc).
Now, I want to add a new column to this data frame, column "flag", where values are assigned as follows:
for d in df.iterrows():
if d.entity == 10 and d.value != 1000 and d.location == CA:
d.flag = "A"
elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
d.flag = "C"
elif d.entity == 0 and d.value == 1000 and d.location == US"
d.flag = "B"
else:
print("Different case")
Is there a way to speed this up and use some built in functions instead of the for loop?
Use np.select which you pass a list of conditions, based on those conditions you give it choices and you can specify a default value when none of the conditions is met.
conditions = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
choices = ["A", "C", "B"]
df['flag'] = np.select(conditions, choices, default="Different case")
Add () with bitwise and -> & for working with numpy.select:
m = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")
You wrote "find all columns which fulfill a set of conditions", but your code shows you're actually trying to add a new column whose value for each row is computed from the values of other columns of the same row.
If that's indeed the case, you can use df.apply, giving it a function that computes the value for a specific row:
def flag_value(row):
if row.entity == 10 and row.value != 1000 and row.location == CA:
return "A"
elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
return "C"
elif row.entity == 0 and row.value == 1000 and row.location == US:
return "B"
else:
return "Different case"
df['flag'] = df.apply(flag_value, axis=1)
Take a look at this related question for more information.
If you truly want to find all columns which specify some condition, the usual way to do this with a Pandas dataframe is to use df.loc and indexing:
only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]

Categories

Resources