how to find difference in dates in pandas dataframe in Azure ML - python

Is Azure uses some other Syntax for finding difference in dates and time.
or
Any package is missing in Azure.
how to find difference in dates in pandas data-frame in Azure ML.
I have 2 columns in a dataframe and I have to find the difference of two and have to kept in third column ,the problem is this, all this runs well in python IDE but not in Microsoft Azure.
My date format : 2015-09-25T01:45:34.372Z
I have to to find df['days'] = df['a'] - df['b']
I have tried almost all the syntax available on stackoverflow.
Please help
mylist = ['app_lastCommunicatedAt', 'app_installAt', 'installationId']
'def finding_dates(df, mylist):
for i in mylist:
if i == 'installationId':
continue
df[i] = [pd.to_datetime(e) for e in df[i]]
df['days'] = abs((df[mylist[1]] - df[mylist[0]]).dt.days)
return df'
when I am calling this function it is giving error and not accepting lines below continue.
I had also tried many other things like converting dates to string, etc

Per my experience, it seems that the issue was caused by your code without the dataframe_service which indicates that the function operations on a data frame, please see https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python#dataframe_service. If not being familiar with the decorator #, please see https://www.python.org/dev/peps/pep-0318/ to know it.

Related

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.

How to use parse from phonenumbers Python library on a pandas data frame?

How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?
I am trying to use a port of Google's libphonenumber library on Python,
https://pypi.org/project/phonenumbers/.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.
Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).
The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.
Love the solution of #katelie, but here's my code. Added a try/except block to skip the format_number function from failing. It cannot handle strings that are too long.
import phonenumber as phon
def formatE164(self):
try:
return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
except:
pass
df['column'] = df['column'].apply(formatE164)

Pandas group events by year

I am very new to pandas but making progress...
I have the following dataframe:
I want to do a count on the number of events that have happened by Month/Year which I believe would produce something like the below
I have tried the following based on the article located here
group = df.groupby(['MonthYear', 'EventID']).count()
frequency = group['EventID'].groupby(level=0, group_keys=False)
print(frequency)
I then get an error (using VS Code) that states:
unable to open 'hashtable_class_helper.pxi'
I have had this before and it is usually when I have used the wrong case for my column names but I have verified they are correct.
Where am I going wrong?
you can use:
frequency= df.groupby('MonthYear')['EventID'].value_counts()
See documentation for more details
You could try aggregation on top of groupBy df.groupby('MonthYear').agg({'EventID':'count'})

Merge GoogleTrends Data Reports in Python

I'm quite new to Python and... well... let's say, not really an expert when it comes to coding. So apologies for the very amateurish question in advance. I'm trying to merge several googletrends report.csv files to use for my research.
Two problems I encounter:
The report files aren't just a spreadsheet but contain lots of other information that is irrelevant. I.e. I just want a certain array of each file to be merged (really just want the daily data containing the dates and the corresponding SVI for each month. Say: column 6 to 30)
As the (daily) data will be extracted from monthly report file and months do not have a constant number of days I cannot just use fixed column numbers to be read but would need those to be according to the number of days the specific months has.
Many thanks for the help!
Edit:
The code I use:
import pandas as pd
report = pd.read_csv('C:/Users/paul/Downloads/report.csv', skiprows=4, skipfooter=17)
print(report)
The output it produces
I managed to cut the first few lines off but I don't know how to cut off the bottom bit from row 31 onwards. So skipfooter didn't seem to work. But I can't use nrows as the months don't have the same number of days, so I won't know the number of rows in advance.
It turned out that it does help to occasionally read the warnings python gives.
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skip_footer; you can avoid this warning by specifying engine='python'.
The problem I had, that the skip_footer option didn't work, was apparently related to the c engine used.
For anyone running into the same issue, here's the code I solved it with:
import pandas as pd
report = pd.read_csv('C:/Users/paul/Downloads/report.csv', skiprows=4, skip_footer=27, engine='python')
print(report)
Just add engine='python' to get rid of the c engine problem. Don't ask me why I had to skip 27 rows in the end (I was pretty sure I counted 17), but with a bit of trial and error this just worked.

Python Pandas - Main DataFrame, want to drop all columns in smaller DataFrame

I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this.
I now want to delete the columns contained within 'public' from the larger DataFrame ('main').
I've tried the following instructions:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html
Python Pandas - Deleting multiple series from a data frame in one command
without any success, along with various other statements that have been unsuccessful.
The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function].
Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful.
(Have Python 2.7 and am using Pandas, numpy, math, pylab etc.)
Thanks in advance.
Ignore my question - Murphy's Law prevails and I've just solved it.
I was using the statement from the stackoverflow question mentioned below:
df.drop(df.columns[1:], axis=1)
and this was not working. I have instead used
df = df.drop(df2, axis=1)
and this worked (df = main, df2 = public). Simple really once you don't overthink it.

Categories

Resources