I have a df such as below ( 3 rows for example )
ID | Dollar_Value
C 45.32
E 5.21
V 121.32
When I view the df in my notebook such as df:
It shows the Dollar_value as
ID | Dollar_Value
C 8.493000e+01
E 2.720000e+01
V 1.720000e+01
Instead of the regular format, but when I try to filter the df for specific ID, it shows the values as they are supposed to be ( 82.23 or 2.45)
df[df['ID'] == 'E']
ID | Dollar_Value
E 45.32
is there something I have to do formatting wise? So the df itself can display the value column as its supposed to?
Thanks!
You can try run this code before print , since you columns may have big number or very small number.(Check with df.describe())
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Related
Below is the dataframe generated using python and transfered to csv file. The number of delimiter i.e (|) are 9 as shown below
Date|ID|CD|BIN|INTRNL|PCC|IND|CENTRE|TRANS|ENTITY
20221231|APPLE|10004050|BCH_dummy|3505|N|Y|Y|6310|
20221231|APPLE|10004050|BCH_MOTOR|3502|N|Y|Y|6310|
Dataframe:
Date ID CD BIN INTRNL PCC IND CENTRE TRANS ENTITY
20221231 APPLE 10004050 BCH_dummy 3505 N Y Y 6310
20221231 APPLE 10004050 BCH_MOTOR 3502 N Y Y 6310
But I want to add an extra column name on the left side of Date column and maintain the same number of delimeter(|) which is 9 as shown below
Expected Output in CSV file:
BDR2|Date|ID|CD|BIN|INTRNL|PCC|IND|CENTRE|TRANS|ENTITY
20221231|APPLE|10004050|BCH_dummy|3505|N|Y|Y|6310|
20221231|APPLE|10004050|BCH_MOTOR|3502|N|Y|Y|6310|
df.insert(0, column="BDR2", value='')
df = df.shift(-1, axis = 1)
df.replace("nan",'',inplace=True)
df.to_csv(r"C:\INPUT\df_sample_test.csv",sep='|',index=False)
Technically, I don't think it's possible.
However, you can cheat/fake it by making a one-column csv like so :
out = (
pd.read_csv("inputfile.csv", sep="|")
.rename({"Date": "BDR2|Date"}, axis=1)
.fillna("").astype(str)
.pipe(lambda x: x.agg("|".join, axis=1)
.to_frame(f'{"|".join(x.columns)}'))
)
out.to_csv(outputfile.csv, index=False)
Output :
print(out.to_csv(sys.stdout, index=False))
BDR2|Date|ID|CD|BIN|INTRNL|PCC|IND|CENTRE|TRANS|ENTITY
20221231|APPLE|10004050|BCH_dummy|3505|N|Y|Y|6310|
20221231|APPLE|10004050|BCH_MOTOR|3502|N|Y|Y|6310|
I have a dataframe that come from SharePoint (Microsoft), and it has a lot of jsons inside the cells with the metadata. i usually dont work with json, so im struggling with it.
# df sample
+-------------+----------+
| Id | Event |
+-------------+----------+
| 105 | x |
+-------------+----------+
x = {"#odata.type":"#Microsoft.Azure.Connectors.SharePoint.SPListExpandedReference","Id":1,"Value":"Digital Training"}
How i assign just the value "Digital Training" to the cell, for example? remembering that this is ocurring for a lot of columns, and i need to solve it too. Thanks.
If the event column consists of dict-object:
df['Value'] = df.apply(lambda x: x['Event']['Value'], 1)
If the event column has string objects:
import json
df['Value'] = df.apply(lambda x: json.loads(x['Event'])['Value'], 1)
Both result in
Id Event Value
0 x {"#odata.type":"#Microsoft.Azure.Connectors.Sh... Digital Training
I am using a module called pyhaystack to retrieve data (rest API) from a building automation system based on 'tags.' Python will return a dictionary of the data. Im trying to use pandas with an If Else statement further below that I am having trouble with. The pyhaystack is working just fine to get the data...
This connects me to the automation system: (works just fine)
from pyhaystack.client.niagara import NiagaraHaystackSession
import pandas as pd
session = NiagaraHaystackSession(uri='http://0.0.0.0', username='Z', password='z', pint=True)
This code finds my tags called 'znt', converts dictionary to Pandas, and filters for time: (works just fine for the two points)
znt = session.find_entity(filter_expr='znt').result
znt = session.his_read_frame(znt, rng= '2018-01-01,2018-02-12').result
znt = pd.DataFrame.from_dict(znt)
znt.index.names=['Date']
znt = znt.fillna(method = 'ffill').fillna(method = 'bfill').between_time('08:00','17:00')
What I am most interested in is the column name, where ultimately I want Python to return the column named based on conditions:
print(znt.columns)
print(znt.values)
Returns:
Index(['C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT', 'C.Drivers.NiagaraNetwork.points.A-Section.AV2.AV2ZN~2dT'], dtype='object')
[[ 65.9087 66.1592]
[ 65.9079 66.1592]
[ 65.9079 66.1742]
...,
[ 69.6563 70.0198]
[ 69.6563 70.2873]
[ 69.5673 70.2873]]
I am most interested in this name of the Pandas dataframe. C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT
For my two arrays, I am subtracting the value of 70 for the data in the data frames. (works just fine)
znt_sp = 70
deviation = znt - znt_sp
deviation = deviation.abs()
deviation
And this is where I am getting tripped up in Pandas. I want Python to print the name of the column if the deviation is greater than four else print this zone is Normal. Any tips would be greatly appreciated..
if (deviation > 4).any():
print('Zone %f does not make setpoint' % deviation)
else:
print('Zone %f is Normal' % deviation)
The columns names in Pandas are the:
C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT
I think DataFrame would be a good way to handle what you want.
Starting with znt you can make all the calculation there :
deviation = znt - 70
deviation = deviation.abs()
# and the cool part is filtering in the df
problem_zones =
deviation[deviation['C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-
Section.AV1.AV1ZN~2dT']>4]
You can play with this and figure out a way to iterate through columns, like :
for each in df.columns:
# if in this column, more than 10 occurences of deviation GT 4...
if len(df[df[each]>4]) > 10:
print('This zone have a lot of troubles : ', each)
edit
I like adding columns to a DataFrame instead of just building an external Series.
df[‘error_for_a’] = df[a] - 70
This open possibilities and keep everything together. One could use
df[df[‘error_for_a’]>4]
Again, all() or any() can be useful but in a real life scenario, we would probably need to trig the “fault detection” when a certain number of errors are present.
If the schedule has been set ‘occupied’ at 8hAM.... maybe the first entries won’t be correct.... (any would trig an error even if the situation gets better 30minutes later). Another scenario would be a conference room where error is tiny....but as soon as there are people in it...things go bad (all() would not see that).
Solution:
You can iterate over columns
for col in df.columns:
if (df[col] > 4).any(): # or .all if needed
print('Zone %s does not make setpoint' % col)
else:
print('Zone %s is Normal' % col)
Or by defining a function and using apply
def _print(x):
if (x > 4).any():
print('Zone %s does not make setpoint' % x.name)
else:
print('Zone %s is Normal' % x.name)
df.apply(lambda x: _print(x))
# you can even do
[_print(df[col]) for col in df.columns]
Advice:
maybe you would keep the result in another structure, change the function to return a boolean series that "is normal":
def is_normal(x):
return not (x > 4).any()
s = df.apply(lambda x: is_normal(x))
# or directly
s = df.apply(lambda x: not (x > 4).any())
it will return a series s where index is column names of your df and values a boolean corresponding to your condition.
You can then use it to get all the Normal columns names s[s].index or the non-normal s[~s].index
Ex : I want only the normal columns of my df: df[s[s].index]
A complete example
For the example I will use a sample df with a different condition from yours (I check if no element is lower than 4 - Normal else Does not make the setpoint )
df = pd.DataFrame(dict(a=[1,2,3],b=[2,3,4],c=[3,4,5])) # A sample
print(df)
a b c
0 1 2 3
1 2 3 4
2 3 4 5
Your use case: Print if normal or not - Solution
for col in df.columns:
if (df[col] < 4).any():
print('Zone %s does not make setpoint' % col)
else:
print('Zone %s is Normal' % col)
Result
Zone a is Normal
Zone b is does not make setpoint
Zone c is does not make setpoint
To illustrate my Advice : Keep the is_normal columns in a series
s = df.apply(lambda x: not (x < 4).any()) # Build the series
print(s)
a True
b False
c False
dtype: bool
print(df[s[~s].index]) #Falsecolumns df
b c
0 2 3
1 3 4
2 4 5
print(df[s[s].index]) #Truecolumns df
a
0 1
1 2
2 3
I am working on analyzing customer return behavior and am working with the following dataframe df:
Customer_ID | Order | Store_ID | Date | Item_ID | Count_of_units | Event_Return_Flag
ABC 123 1 23052016 A -1 Y
ABC 345 1 23052016 B 1 0
ABC 567 1 24052016 C -1 0
I need to add another column to find customers who returned during the event (Event_Return_Flag=Y) and bought something in the same day and store.
In other words, I wanted to add a flag df['target'] with the following if logic:
same Customer_ID, Store_ID, Date as a record with Event_Return_Flag=Y
but different Item_ID then the record with Event_Return_Flag=Y
Count_of_units > 0
I don't know how to accomplish this in python pandas.
I was thinking of creating a key by concatenating Customer_ID, Store_ID and Date; then spliting the file by Event_Return_flag and using an isin statement, something like this:
df['key']=df['Customer_ID']+'_'+df['Store_ID']+'_'+df['Date'].apply(str)
df_1 = df.loc[df['Event_Return_Flag'] == 'Y']
df_2 = df.loc[df['Event_Return_Flag'] == '0']
df_3 = df2.loc[df['Count_of_units'] > 0]
df3['target'] = np.where(df3['key'].isin(df1['key']), 'Y', 0)
This approach seems qutie wrong, but I couldn't come up with something better. I get this error message for the last line with np.where:
C:\Users\xxx\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':
I tried something down this line, but couldn't figure out how to match rows based on column Event_Return_Flag
df['target'] = (np.where((df.Item_Units_S > 0)&(df.groupby(['key','Item_ID']).Event_Return_flag.transform('nunique') > 1), 'Y', ''))
I have a dataframe from which I want to know the highest value for each column. But I also want to know in what row it happened.
With my code I have to put the name of each column each time. Is there a better way to get all highest values from all columns?
df2.loc[df2['ALL'].idxmax()]
THE DATAFRAME
WHAT I GET WITH MY CODE
WHAT I WANT
THE DATAFRAME
You can stack your frame and then sort the values from largest to smallest and then take the first occurrence of your column names.
First I will create some fake data
df = pd.DataFrame(np.random.rand(10,5), columns=list('abcde'),
index=list('nopqrstuvw'))
df.columns.name = 'level_0'
df.index.name = 'level_1'
Output
level_0 a b c d e
level_1
n 0.417317 0.821350 0.443729 0.167315 0.281859
o 0.166944 0.223317 0.418765 0.226544 0.508055
p 0.881260 0.789210 0.289563 0.369656 0.610923
q 0.893197 0.494227 0.677377 0.065087 0.228854
r 0.394382 0.573298 0.875070 0.505148 0.334238
s 0.046179 0.039642 0.930811 0.326114 0.880804
t 0.143488 0.561449 0.832186 0.486752 0.323215
u 0.891823 0.616401 0.247078 0.497050 0.995108
v 0.888553 0.386260 0.816100 0.874761 0.769073
w 0.557239 0.601758 0.932839 0.274614 0.854063
Now stack, sort and drop all but the first column occurrence
df.stack()\
.sort_values(ascending=False)\
.reset_index()\
.drop_duplicates('level_0')\
.sort_values('level_0')[['level_0', 0, 'level_1']]
level_0 0 level_1
3 a 0.893197 q
12 b 0.821350 n
1 c 0.932839 w
9 d 0.874761 v
0 e 0.995108 u