I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini
I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})
I am trying to find difference between two different time fields in pig relation . I can use todate() method of pig but for that it should be in hhmm format. However it does not have leading zeros. For example if the two field had value 1245 and 1425 I can find the difference converting them using todate. However if the value is 945 and 823 then I cannot convert using todate because there is no leading zero.
However I wrote a python udf attempting to leftpad a zero. Please find the code below
#outputSchema("time:bytearray")
def zero(time):
time = str(time)
if len(time)<= 3:
return '0'+ time
else:
return time
Step 1 : Registered my python function
REGISTER '/home/Jig13517/zeropad.py' using jython AS myfuncs ;
Please find the relation below
Airlines_data_schema = LOAD '/user/Jig13517/pigsample/Airlines_data.csv' USING PigStorage('\t') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual,CRSDeptime,Arrtime_actual,CRSArrtime,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);
=====================================
Then I tried to leftpad the column value with zeros
airlines_new = FOREACH Airlines_data_schema GENERATE Year,Month,DayofMonth,DayofWeek,myfuncs.zero($4) AS DepTime_actual_new,myfuncs.zero($5) AS CRSDeptime_new,myfuncs.zero($6) AS Arrtime_actual_new,myfuncs.zero($7) AS CRSArrtime_new,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay ;
===============================
Sample data after application of python udf
(2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA,,,,None,None,None,None,,,,,,,,,,,,,,,,,,,,,)
But we can see above it is not converting the column value . I am getting the same fields unaltered. Please let me to know what is wrong with my udf or is there any any pig method to achieve this task.
The str.zfill function could help
input.txt
1245
1425
945
823
pig_udfs.py
#outputSchema('time:chararray')
def lpad_time(time):
return time.zfill(4)
time_formatter.pig
register pig_udfs.py using jython as myfuncs;
A = LOAD 'input.txt' USING PigStorage();
B = FOREACH A GENERATE myfuncs.lpad_time((chararray) $0);
\d B
Output
(1245)
(1425)
(0945)
(0823)
Obviously, you could make Python do the entire todate function itself...
Also, I wasn't clear in your question if the minutes were zero padded.
EDIT
airlines.csv
2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA,,,,None,None,None,None,,,,,,,,,,,,,,,,,,,,,
pig code
register pig_udfs.py using jython as myfuncs;
A = LOAD 'airlines.csv' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS Year, $1 AS Month, $2 AS DayofMonth, $4 AS DayofWeek,myfuncs.lpad_time((chararray) $4) AS DepTime_actual_new,myfuncs.lpad_time((chararray) $5) AS CRSDeptime_new,myfuncs.lpad_time((chararray) $6) AS Arrtime_actual_new,myfuncs.lpad_time((chararray) $7) AS CRSArrtime_new,$8 AS UniqueCarrier,$9 AS FlightNum,$10 AS TailNum_Plane,$11 AS ActualElapsedTime, $12 AS CRSElapsedTime, $13 AS Airtime, $14 AS Arrdelay, $15 AS Depdelay, $16 AS Origin, $17 AS Dest, $18 AS Distance, $19 AS Taxiin, $20 AS Taxiout, $21 AS Cancelled, $22 AS CancellationCode, $23 AS Diverted, $24 AS CarrierDelay, $25 AS WeatherDelay, $26 AS NASDelay, $27 AS SecurityDelay, $28 AS LateAircraftDelay ;
\d B
Output
(2008,1,3,617,0617,0615,0652,0650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA)
Hey #cricket_007 I got it working.I was passing the column fields as bytearray that was the mistake I was doing. Then when I changed the schema to chararray then it started padding zero. Thanks a lot.
Please find the corrected records below:
(2008,1,3,4,0617,0615,0652,0650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA)
(2008,1,3,4,0628,0620,0804,0750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA)
How can I convert YYYY-MM-DD hh:mm:ss format to integer in python?
for example 2014-02-12 20:51:14 -> to integer.
I only know how to convert hh:mm:ss but not yyyy-mm-dd hh:mm:ss
def time_to_num(time_str):
hh, mm , ss = map(int, time_str.split(':'))
return ss + 60*(mm + 60*hh)
It depends on what the integer is supposed to encode. You could convert the date to a number of milliseconds from some previous time. People often do this affixed to 12:00 am January 1 1970, or 1900, etc., and measure time as an integer number of milliseconds from that point. The datetime module (or others like it) will have functions that do this for you: for example, you can use int(datetime.datetime.utcnow().timestamp()).
If you want to semantically encode the year, month, and day, one way to do it is to multiply those components by order-of-magnitude values large enough to juxtapose them within the integer digits:
2012-06-13 --> 20120613 = 10,000 * (2012) + 100 * (6) + 1*(13)
def to_integer(dt_time):
return 10000*dt_time.year + 100*dt_time.month + dt_time.day
E.g.
In [1]: import datetime
In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:def to_integer(dt_time):
: return 10000*dt_time.year + 100*dt_time.month + dt_time.day
: # Or take the appropriate chars from a string date representation.
:--
In [3]: to_integer(datetime.date(2012, 6, 13))
Out[3]: 20120613
If you also want minutes and seconds, then just include further orders of magnitude as needed to display the digits.
I've encountered this second method very often in legacy systems, especially systems that pull date-based data out of legacy SQL databases.
It is very bad. You end up writing a lot of hacky code for aligning dates, computing month or day offsets as they would appear in the integer format (e.g. resetting the month back to 1 as you pass December, then incrementing the year value), and boiler plate for converting to and from the integer format all over.
Unless such a convention lives in a deep, low-level, and thoroughly tested section of the API you're working on, such that everyone who ever consumes the data really can count on this integer representation and all of its helper functions, then you end up with lots of people re-writing basic date-handling routines all over the place.
It's generally much better to leave the value in a date context, like datetime.date, for as long as you possibly can, so that the operations upon it are expressed in a natural, date-based context, and not some lone developer's personal hack into an integer.
I think I have a shortcut for that:
# Importing datetime.
from datetime import datetime
# Creating a datetime object so we can test.
a = datetime.now()
# Converting a to string in the desired format (YYYYMMDD) using strftime
# and then to int.
a = int(a.strftime('%Y%m%d'))
This in an example that can be used for example to feed a database key, I sometimes use instead of using AUTOINCREMENT options.
import datetime
dt = datetime.datetime.now()
seq = int(dt.strftime("%Y%m%d%H%M%S"))
The other answers focused on a human-readable representation with int(mydate.strftime("%Y%m%d%H%M%S")). But this makes you lose a lot, including normal integer semantics and arithmetics, therefore I would prefer something like bash date's "seconds since the epoch (1970-01-01 UTC)".
As a reference, you could use the following bash command to get 1392234674 as a result:
date +%s --date="2014-02-12 20:51:14"
As ely hinted in the accepted answer, just a plain number representation is unmistakeable and by far easier to handle and parse, especially programmatically. Plus conversion from and to human-readable is an easy oneliner both ways.
To do the same thing in python, you can use datetime.timestamp() as djvg commented. For other methods you can consider the edit history.
Here is a simple date -> second conversion tool:
def time_to_int(dateobj):
total = int(dateobj.strftime('%S'))
total += int(dateobj.strftime('%M')) * 60
total += int(dateobj.strftime('%H')) * 60 * 60
total += (int(dateobj.strftime('%j')) - 1) * 60 * 60 * 24
total += (int(dateobj.strftime('%Y')) - 1970) * 60 * 60 * 24 * 365
return total
(Effectively a UNIX timestamp calculator)
Example use:
from datetime import datetime
x = datetime(1970, 1, 1)
time_to_int(x)
Output: 0
x = datetime(2021, 12, 31)
time_to_int(x)
Output: 1639785600
x = datetime(2022, 1, 1)
time_to_int(x)
Output: 1639872000
x = datetime(2022, 1, 2)
time_to_int(x)
Output: 1639958400
When converting datetime to integers one must keep in mind the tens, hundreds and thousands.... like
"2018-11-03" must be like 20181103 in int
for that you have to
2018*10000 + 100* 11 + 3
Similarly another example,
"2018-11-03 10:02:05" must be like 20181103100205 in int
Explanatory Code
dt = datetime(2018,11,3,10,2,5)
print (dt)
#print (dt.timestamp()) # unix representation ... not useful when converting to int
print (dt.strftime("%Y-%m-%d"))
print (dt.year*10000 + dt.month* 100 + dt.day)
print (int(dt.strftime("%Y%m%d")))
print (dt.strftime("%Y-%m-%d %H:%M:%S"))
print (dt.year*10000000000 + dt.month* 100000000 +dt.day * 1000000 + dt.hour*10000 + dt.minute*100 + dt.second)
print (int(dt.strftime("%Y%m%d%H%M%S")))
General Function
To avoid that doing manually use below function
def datetime_to_int(dt):
return int(dt.strftime("%Y%m%d%H%M%S"))
df.Date = df.Date.str.replace('-', '').astype(int)
Is there a good method to convert a string representing time in the format of [m|h|d|s|w] (m= minutes, h=hours, d=days, s=seconds w=week) to number of seconds? I.e.
def convert_to_seconds(timeduration):
...
convert_to_seconds("1h")
-> 3600
convert_to_seconds("1d")
-> 86400
etc?
Thanks!
Yes, there is a good simple method that you can use in most languages without having to read the manual for a datetime library. This method can also be extrapolated to ounces/pounds/tons etc etc:
seconds_per_unit = {"s": 1, "m": 60, "h": 3600, "d": 86400, "w": 604800}
def convert_to_seconds(s):
return int(s[:-1]) * seconds_per_unit[s[-1]]
I recommend using the timedelta class from the datetime module:
from datetime import timedelta
UNITS = {"s":"seconds", "m":"minutes", "h":"hours", "d":"days", "w":"weeks"}
def convert_to_seconds(s):
count = int(s[:-1])
unit = UNITS[ s[-1] ]
td = timedelta(**{unit: count})
return td.seconds + 60 * 60 * 24 * td.days
Internally, timedelta objects store everything as microseconds, seconds, and days. So while you can give it parameters in units like milliseconds or months or years, in the end you'll have to take the timedelta you created and convert back to seconds.
In case the ** syntax confuses you, it's the Python apply syntax. Basically, these function calls are all equivalent:
def f(x, y): pass
f(5, 6)
f(x=5, y=6)
f(y=6, x=5)
d = {"x": 5, "y": 6}
f(**d)
And another to add to the mix.
This solution is brief, but fairly tolerant, and allows for multiples, such as 10m 30s
from datetime import timedelta
import re
UNITS = {'s':'seconds', 'm':'minutes', 'h':'hours', 'd':'days', 'w':'weeks'}
def convert_to_seconds(s):
return int(timedelta(**{
UNITS.get(m.group('unit').lower(), 'seconds'): float(m.group('val'))
for m in re.finditer(r'(?P<val>\d+(\.\d+)?)(?P<unit>[smhdw]?)', s, flags=re.I)
}).total_seconds())
Test results:
>>> convert_to_seconds('10s')
10
>>> convert_to_seconds('1') # defaults to seconds
1
>>> convert_to_seconds('1m 10s') # chaining
70
>>> convert_to_seconds('1M10S') # case insensitive
70
>>> convert_to_seconds('1week 3days') # ignores 'eek' and 'ays'
864000
>>> convert_to_seconds('This will take 1.25min, probably.') # floats
75
not perfect
>>> convert_to_seconds('1month 3days') # actually 1minute + 3 days
259260
>>> convert_to_seconds('40s 10s') # 1st value clobbered by 2nd
10
I usually need to support raw numbers, string numbers and string numbers ending in [m|h|d|s|w].
This version will handle: 10, "10", "10s", "10m", "10h", "10d", "10w".
Hat tip to #Eli Courtwright's answer on the string conversion.
UNITS = {"s":"seconds", "m":"minutes", "h":"hours", "d":"days", "w":"weeks"}
def convert_to_seconds(s):
if isinstance(s, int):
# We are dealing with a raw number
return s
try:
seconds = int(s)
# We are dealing with an integer string
return seconds
except ValueError:
# We are dealing with some other string or type
pass
# Expecting a string ending in [m|h|d|s|w]
count = int(s[:-1])
unit = UNITS[ s[-1] ]
td = timedelta(**{unit: count})
return td.seconds + 60 * 60 * 24 * td.days
I wrote an Open source library MgntUtils in java (not php) that answers in part to this requirement. It contains a static method parsingStringToTimeInterval(String value) this method parses a string that is expected to hold some time interval value - a numeric value with optional time unit suffix. For example, string "38s" will be parsed as 38 seconds, "24m" - 24 minutes "4h" - 4 hours, "3d" - 3 days and "45" as 45 milliseconds. Supported suffixes are "s" for seconds, "m" for minutes, "h" for hours, and "d" for days. String without suffix is considered to hold a value in milliseconds. Suffixes are case insensitive. If provided String contains an unsupported suffix or holds negative numeric value or zero or holds a non-numeric value - then IllegalArgumentException is thrown. This method returns TimeInterval class - a class also defined in this library. Essentially, it holds two properties with relevant getters and setters: long "value" and java.util.concurrent.TimeUnit. But in addition to getters and setters this class has methods toMillis(), toSeconds(), toMinutes(), toHours() toDays(). Those methods return long vlaue in specified time scale (The same way as corresponding methods in class java.util.concurrent.TimeUnit)
This method may be very useful for parsing time interval properties such as timeouts or waiting periods from configuration files. It eliminates unneeded calculations from different time scales to milliseconds back and forth. Consider that you have a methodInvokingInterval property that you need to set for 5 days. So in order to set the milliseconds value you will need to calculate that 5 days is 432000000 milliseconds (obviously not an impossible task but annoying and error prone) and then anyone else who sees the value 432000000 will have to calculate it back to 5 days which is frustrating. But using this method you will have a property value set to "5d" and invoking the code
long seconds = TextUtils.parsingStringToTimeInterval("5d").toSeconds();
will solve your conversion problem. Obviously, this is not overly complex feature, but it could add simplicity and clarity in your configuration files and save some frustration and "stupid" miscalculation into milliseconds bugs. Here is the link to the article that describes the MgntUtils library as well as where to get it: MgntUtils