How to convert object to string pandas python

How to convert object to string pandas python - python

Most answer given to the questions that seemed like the answers would help weren't helpful to anyone asking any questions. Nor the people answering the questions knew figure things out after they found out what they had contributed didn't work. I have tried pretty much every str() and .to_string variation I could find.
Anyways, I've been trying to get data in a file paired up and omit data I can't pair up. I believe I've paired things together, but there's no way for me to verify this other than seeing the column and true or false.
import pandas as pd
# read file
with open('TastyTrades.csv', 'r') as trade_history:
trade_reader = pd.read_csv('TastyTrades.csv')
# sort out for options only
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
# resort data
date_frame = options_frame.sort_values(by=['Symbol', 'Date', 'Action'], ascending=True)
# pair BTO to STC
BTO = date_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_CLOSE'])
STO = date_frame['Action'].isin(['SELL_TO_OPEN', 'BUY_TO_CLOSE'])
# bringing both frames as one
pairs = [BTO, STO]
# return readable data
result = pd.concat(pairs).astype(str)
# write to new file
result.to_csv('new_taste.csv')
This code bring me:
,Action
101,True
75,True
102,False
76,False
95,False
97,True
98,True
38,True
174,True
166,True
I am trying to get the data back to the readable format:
Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
2020-02-14T15:49:12-0500,Trade,SELL_TO_OPEN,TGT 200327C00127000,Equity Option,Sold 1 TGT 03/27/20 Call 127.00 # 1.33,133,1,133,-1,-0.15,100,TGT,3/27/2020,127,CALL
2020-02-14T15:49:11-0500,Trade,SELL_TO_OPEN,TGT 200327P00107000,Equity Option,Sold 1 TGT 03/27/20 Put 107.00 # 1.80,180,1,180,-1,-0.15,100,TGT,3/27/2020,107,PUT
2020-02-14T15:49:11-0500,Trade,BUY_TO_OPEN,TGT 200327C00128000,Equity Option,Bought 1 TGT 03/27/20 Call 128.00 # 1.17,-117,1,-117,-1,-0.14,100,TGT,3/27/2020,128,CALL

Here BTO and STO will only have the result of isin condition (true or false).
So, Rewrite two of your lines as below:
BTO = date_frame[date_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_CLOSE'])]
STO = date_frame[date_frame['Action'].isin(['SELL_TO_OPEN', 'BUY_TO_CLOSE'])]
This will give all the columns into BTO and STO and then you can merge these two DF's.
Hope this helps.
Check below working code:
I tried it, and got expected result. All the rows using above code. I tried without converting to 'str', that too gave me same result. Try printing the result and see what it shows.
BTO = quotes[quotes['Action'].isin(['BTO', 'STC'])]
STO = quotes[quotes['Action'].isin(['STO', 'BTC'])]
frames = [BTO,STO]
result = pd.concat(frames).astype(str)
result.to_csv('new_taste.csv')

Related

mask function doesn't get rid of unwanted data

I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.

Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()

Performing similar analysis on multiple dataframes

I am reading data from multiple dataframes.
Since the indexing and inputs are different, I need to repeat the pairing and analysis. I need dataframe specific outputs. This pushes me to copy paste and repeat the code.
Is there a fast way to refer to multiple dataframes to do the same analysis?
DF1= pd.read_csv('DF1 Price.csv')
DF2= pd.read_csv('DF2 Price.csv')
DF3= pd.read_csv('DF3 Price.csv') # These CSV's contain main prices
DF1['ParentPrice'] = FamPrices ['Price1'] # These CSV's contain second prices
DF2['ParentPrice'] = FamPrices ['Price2']
DF3['ParentPrice'] = FamPrices ['Price3']
DF1['Difference'] = DF1['ParentPrice'] - DF1['Price'] # Price difference is the output
DF2['Difference'] = DF2['ParentPrice'] - DF2['Price']
DF3['Difference'] = DF3['ParentPrice'] - DF3['Price']```

It is possible to parametrize strings using f-strings, available in python >= 3.6. In an f string, it is possible to insert the string representation of the value of a variable inside the string, as in:
>> a=3
>> s=f"{a} is larger than 11"
>> print(s)
3 is larger than 1!
Your code would become:
list_of_DF = []
for symbol in ["1", "2", "3"]:
df = pd.read_csv(f"DF{symbol} Price.csv")
df['ParentPrice'] = FamPrices [f'Price{symbol}']
df['Difference'] = df['ParentPrice'] - df['Price']
list_of_DF.append(df)
then DF1 would be list_of_DF[0] and so on.
As I mentioned, this answer is only valid if you are using python 3.6 or later.

for the third part ill suggest to create a something like
DFS=[DF1,DF2,DF3]
def create_difference(dataframe):
dataframe['Difference'] = dataframe['ParentPrice'] - dataframe['Price']
for dataframe in DFS:
create_difference(dataframe)
for the second way there is no like superconvenient and short way i might think about , except maybe of
for i in range len(DFS) :
DFS[i]['ParentPrice'] = FamPrices [f'Price{i}']

pandas - drop row with list of values, if contains from list

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.

is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992

For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

How to correctly modify timestamps on streaming data to create unique indexes

The type of data we are streaming in is taken from our PI System which is outputting data in an irregular manner. This is not uncommon with time series data, so I have attempted to add 1 second or so to each time stamp to ensure the index is unique. However this has not worked as I hoped as I keep received a type error.
I have attempted to implement the solutions highlighted in (Modifying timestamps in pandas to make index unique) however without any success.
The error message I get is:
TypeError: ufunc add cannot use operands with types dtype('O') and dtype('<m8')
The code implementation is below:
values = Slugging_Sep.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
# print result
result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64)
print(result)
What I have tried
Type Casting - I thought that the calculation was due to two
different types being added together but this hasn't resolved the
issue.
Using Time Delta in Pandas - This creates the same Type Error.
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms'))
Slugging_Sep['Time'] = (str(Slugging_Sep['Time'] +
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms')))
So I have two questions from this:
Could anyone provide some advice to me regarding how to solve this
for future time series issues?
What actually is dtype ('<m8')
Thank you.

Using Alex Zisman's suggestion, I reconverted the Slugging_Sep.index via the following line:
pd.to_datetime(Slugging_Sep['Time'])
Slugging_Sep.set_index('Time', inplace=True)
I then implemented the following code taken from the above SO link I mentioned:
#values = Slugging_Sep.index.duplicated(keep=False).astype(float)
#values[values==0] = np.NaN
#missings = np.isnan(values)
#cumsum = np.cumsum(~missings)
#diff = np.diff(np.concatenate(([0.], cumsum[missings])))
#values[missings] = -diff
# print result
#result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64())
#Slugging_Sep.index = result
#print(Slugging_Sep.index)
This resolved the issue and added nanoseconds to each duplicate time stamp so it became a unique index.

Two Dataframes, one with more columns than the other -> Subtract and Combine

Ok, I know the title may be a little bit confusing, but I will try to explain this in detail:
I use Python 3.5.2:
I got two .csv files that I read via pandas and convert into two separate dataframes. The first dataframe (coming from XYZ.csv) looks like this:
ip community
10.0.0.1 OL123
.
.
.
123.12.5.31 IK753
The second (export.csv) just has the "ip" column.
Now what I want to do:
I want to compare the two dataframes and as a result get a third dataframe (or list) that contains all ip-addresses that are in the first dataframe but not in the other WITH their correlating community. So far, I managed to compare the two and get a proper result, as long as the second dataframe also contains the communities. I manually inserted those communites into the second export.csv, unfortunately I cannot automate this and that is why I need this to work without the second dataframe containing the communities.
This is my code:
def compare_csvs():
timestamp = time.strftime("%Y-%m-%d")
# Reads XYZ.csv and creates list that contains all ip addresses in integer format.
A = pd.read_csv("XYZ.csv", index_col=False, header=0)
ips1 = A.ip.tolist()
comu1 = A.ro_community.tolist()
AIP = []
for element1 in ips1:
AIP.append(int(ipaddress.IPv4Address(element1)))
IPACOM1 = zip(AIP,comu1)
# Reads export.csv and creates list that contains all ip addresses in integer format.
B = pd.read_csv("export" + timestamp + ".csv", index_col=False, header=0)
ips2 = B.ip.tolist()
comu2 = B.ro_community.tolist()
BIP = []
for element2 in ips2:
BIP.append(int(ipaddress.IPv4Address(element2)))
IPACOM2 = zip(BIP,comu2)
# Creates a set that contains all ip addresses (in integer format) that exist inside the XYZ.csv but not the export.csv.
DeltaInt = OrderedSet(IPACOM1)-OrderedSet(IPACOM2)
List = list(DeltaInt)
UnzippedIP = []
UnzippedCommunity = []
UnzippedIP, UnzippedCommunity = zip(*List)
# Puts all the elements of the DeltaInt set inside a list and also changes the integers back to readable IPv4-addresses.
DeltaIP = []
for element3 in UnzippedIP:
DeltaIP.append(str(ipaddress.IPv4Address(element3)))
IPandCommunity = zip(DeltaIP,UnzippedCommunity)
Now all I need is something that can compare the two lists I created and keep the "community" with the "ip" it is assigned to. I tried a whole lot but I can't seem to get anything to work. Maybe I am just having a problem with the logic here, all help is appreciated!
Also, excuse the code mess, I just threw all that together and will clean it up once the code actually works.

Here is some dummy data to play with:
This is df:
ip community
10.0.0.1 OL123
10.1.1.1 ACLSH
10.9.8.7 OKUAJ1
123.12.5.31 IK753
df = pd.read_clipboard()
This is export.csv:
s_export = pd.Series(s_export = pd.Series(name='ip', data=['10.1.1.1','123.12.5.31', '0.0.0.0'])
s_export
0 10.1.1.1
1 123.12.5.31
2 0.0.0.0
Name: ip, dtype: object
To select the ones that aren't in export, we can simply use boolean indexing using isin():
# ~ means 'not', so here that's "find df.ip that is NOT in s_export"
# Store result in a dataframe
df_exclude = df[~df.ip.isin(s_export)]
df_exclude
ip community
0 10.0.0.1 OL123
2 10.9.8.7 OKUAJ1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert object to string pandas python - python

Related

mask function doesn't get rid of unwanted data

Performing similar analysis on multiple dataframes

pandas - drop row with list of values, if contains from list

How to correctly modify timestamps on streaming data to create unique indexes

Two Dataframes, one with more columns than the other -> Subtract and Combine

Categories

Resources