Related
How do I sort the following CSV file with the date from newest to oldest? The dates are unformatted, I know I can format them, But what methods can be applied for both of the conditions?
IDN,NAME,Gender,DOJ,JOB ID,SALARY
100,Alpha Fenn,M,17-06-2003,AD_PRES,24000
101,Axpire Ced,F,2-9-2005,AD_VP,17000
102,Winston Cor,M,13-01-2001,AD_VP,17000
103,Relv Dest,M,3/1/2006,IT_PROG,9000
Is there any way to sort the whole CSV file with the order of DATE (DOJ)?
Sorted_Data = sorted(csv.reader(open('Empl.csv')), key=lambda x:datetime.strptime(x[4],"%d/%m/%Y"), reverse=True))
The above code does works but only if the date is well-formatted and It only sorts the one column.
After sorting it should look like this:
IDN,NAME,Gender,DOJ,JOB ID,SALARY
103,Relv Dest,M,3/1/2006,IT_PROG,9000
101,Axpire Ced,F,21-09-2005,AD_VP,17000
100,Alpha Fenn,M,17-06-2003,AD_PRES,24000
102,Winston Cor,M,13-01-2001,AD_VP,17000
Use pandas and sort_values
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""IDN,NAME,Gender,DOJ,JOB ID,SALARY
100,Alpha Fenn,M,17-06-2003,AD_PRES,24000
101,Axpire Ced,F,2-9-2005,AD_VP,17000
102,Winston Cor,M,13-01-2001,AD_VP,17000
103,Relv Dest,M,3/1/2006,IT_PROG,9000"""))
# Or if you have it in a csv file then use
# df = pd.read_csv('file_name.csv')
df['DOJ'] = pd.to_datetime(df['DOJ'])
df.sort_values(by=['DOJ'], ascending=False, inplace=True)
df.to_csv()
output
',IDN,NAME,Gender,DOJ,JOB ID,SALARY\n
3,103,Relv Dest,M,2006-03-01,IT_PROG,9000\n
1,101,Axpire Ced,F,2005-02-09,AD_VP,17000\n
0,100,Alpha Fenn,M,2003-06-17,AD_PRES,24000\n
2,102,Winston Cor,M,2001-01-13,AD_VP,17000\n'
I have a data set sampling below to be processed with python or scala:
FWD,13032009:09:01,10.56| FWD,13032009:10:53,11.23| FWD,13032009:15:40,23.20
SPOT,13032009:09:04,11.56| FWD,13032009:11:45,11.23| SPOT,13032009:12:30,23.20
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20| FWD,13032009:14:340,56.00
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20
Every line is to be split into multiple smaller string that can be further splitted.
What I am looking for is an efficient way to generate an RDD or Dataframe with content below:
FWD,13032009:09:01,10.56
FWD,13032009:10:53,11.23
FWD,13032009:15:40,23.20
SPOT,13032009:09:04,11.56
FWD,13032009:11:45,11.23
SPOT,13032009:12:30,23.20
FWD,13032009:08:01,10.56
SPOT,13032009:12:30,11.23
FWD,13032009:13:20,23.20
FWD,13032009:14:340,56.00
FWD,13032009:08:01,10.56
SPOT,13032009:12:30,11.23
FWD,13032009:13:20,23.20
Note the more efficient the better as the total row count in production could be as large as million
Thank you very much.
Assuming you have are reading from a csv file, you can read each line to a list. Flatten the values and then process them as individual rows.
Read file into a list - 1 million rows should not be too much to handle:
import csv
import itertools
import pandas as pd
with open('test.csv','r') as f:
reader = csv.reader(f, delimiter = '|')
rows = list(reader)
Flatten and split from a single list - the excellent itertools library from Python's standard library returns a generator which helps with memory and is efficient.
flat_rows = itertools.chain.from_iterable(rows)
list_rows = [i.strip().split(',') for i in flat_rows]
The nested list,list_rows now gives you a clean and formatted list which you can send to pandas if you want to create a dataframe.
list_rows
>>
[['FWD', '13032009:09:01', '10.56'],
['FWD', '13032009:10:53', '11.23'],
['FWD', '13032009:15:40', '23.20'],
['SPOT', '13032009:09:04', '11.56'],
['FWD', '13032009:11:45', '11.23'],
['SPOT', '13032009:12:30', '23.20'],
['FWD', '13032009:08:01', '10.56'],
['SPOT', '13032009:12:30', '11.23'],
['FWD', '13032009:13:20', '23.20'],
['FWD', '13032009:14:340', '56.00'],
['FWD', '13032009:08:01', '10.56'],
['SPOT', '13032009:12:30', '11.23'],
['FWD', '13032009:13:20', '23.20']]
df = pd.DataFrame(list_rows)
Python solution: If you get the text as a string, you can replace() your | sequence with a newline (\n) and then read that as a DataFrame:
import pandas as pd
from io import StringIO
data_set = """FWD,13032009:09:01,10.56| FWD,13032009:10:53,11.23| FWD,13032009:15:40,23.20
SPOT,13032009:09:04,11.56| FWD,13032009:11:45,11.23| SPOT,13032009:12:30,23.20
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20| FWD,13032009:14:340,56.00
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20
"""
data_set *= 100000 # Make it over a million elements to ensure performance is adequate
data_set = data_set.replace("| ", "\n")
data_set_stream = StringIO(data_set) # Pandas needs to read a file-like object, so need to turn our string into a buffer
df = pd.read_csv(data_set_stream)
print(df) # df is our desired DataFrame
Here is the Scala way if you are interested,
val rdd1 = sc.parallelize(List("FWD,13032009:09:01,10.56| FWD,13032009:10:53,11.23| FWD,13032009:15:40,23.20", "SPOT,13032009:09:04,11.56| FWD,13032009:11:45,11.23| SPOT,13032009:12:30,23.20","FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20| FWD,13032009:14:340,56.00","FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20"))
val rdd2 = rdd1.flatMap(l => l.replaceAll(" ","").split("\\|"))
val rds = rdd2.toDS
val df = spark.read.csv(rds)
df.show(false)
+----+---------------+-----+
|_c0 |_c1 |_c2 |
+----+---------------+-----+
|FWD |13032009:09:01 |10.56|
|FWD |13032009:10:53 |11.23|
|FWD |13032009:15:40 |23.20|
|SPOT|13032009:09:04 |11.56|
|FWD |13032009:11:45 |11.23|
|SPOT|13032009:12:30 |23.20|
|FWD |13032009:08:01 |10.56|
|SPOT|13032009:12:30 |11.23|
|FWD |13032009:13:20 |23.20|
|FWD |13032009:14:340|56.00|
|FWD |13032009:08:01 |10.56|
|SPOT|13032009:12:30 |11.23|
|FWD |13032009:13:20 |23.20|
+----+---------------+-----+
In a dataframe, I have a column "UnixTime" and want to convert it to a new column containing the UTC time.
import pandas as pd
from datetime import datetime
df = pd.DataFrame([1565691196, 1565691297, 1565691398], columns = ["UnixTime"])
unix_list = df["UnixTime"].tolist()
utc_list = []
for i in unix_list:
i = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
utc_list.append(i)
df["UTC"] = utc_list
This works, but I guess there is a smarter approach?
Could you try this:
df["UTC"] = pd.to_datetime(df['UnixTime'], unit='s')
If you mean by smarter approach is pandas-way and less code, then this is your answer :
df["UTC"] = pd.to_datetime(df["UnixTime"], unit = "s")
Hope this helps.
I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?
You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/
i have a pandas dataframe having a column as
from pandas import DataFrame
df = pf.DataFrame({ 'column_name' : [u'Monday,30 December,2013', u'Delivered', u'19:23', u'1']})
now i want to extract every thing from it and store in three columns as
date status time
[30/December/2013] ['Delivered'] [19:23]
i have so far used this :
import dateutil.parser as dparser
dparser.parse([u'Monday,30 December,2013', u'Delivered', u'19:23', u'1'])
but this throws an error . can anyone please guide me to a solution ?
You can apply() a function to a column, see the whole example:
from pandas import DataFrame
df = DataFrame({'date': ['Monday,30 December,2013'], 'delivery': ['Delivered'], 'time': ['19:23'], 'status':['1']})
# delete the status column
del df['status']
def splitter(val):
parts = val.split(',')
return parts[1]
df['date'] = df['date'].apply(splitter)
This yields a dataframe with date, delivery and the time.