Click here to view the image
Above image is my working environment (i.e. Pandas (python))
I have a csv file, I transferred the contents of csv file into python.
file_path=filedialog.askopenfilename()
csv_file=open(file_path,'r')
pd.read_csv(csv_file)
Now after the set of codes, I can display it's contents in Python pandas as a Table.
Now I want to decode the Data in one particular Column "Batch"
In the picture, you can see a Table and in that table the particular column "Batch" is very important which is to be decoded.
Look into the data under the **column Batch.
First Character : Year
Second Character : Alphabet. It is mapped to a month (A-jan, B-feb,C- Mar, D- April, E- May.......)
Third & 4th character ; Date
Ex: The manufacturing date for 6B08MK1D11 is 08-02-2016.
Now I want to decode the every individual data in a column to find it's date based on it's Batch number. After decoding , I want to create a new column in which I have the values of seperated dates put into a new column.
For Example
after decoding this data "6B08MK1D11" I get the date as 08-02-2016. Now for all such individual batch number, I will get individual date and now the new date values should be placed by creating a new column inside the same table.
After creating a new column, the Date column should be sorted A-Z (ascending).
I tried to teach how to assign months to Python: like following,
for everycode[1] in Bat:
if everycode[1]=='A':
everycode[1] = 'Jan'
if everycode[1]=='B':
everycode[1] = 'Feb'
if everycode[1]=='C':
everycode[1] = 'Mar'
if everycode[1]=='D':
everycode[1]= 'Apr'
if everycode[1]=='E':
everycode[1]= 'May'
if everycode[1]=='F':
everycode[1]= 'Jun'
if everycode[1] == 'G':
everycode[1]= 'Jul'
if everycode[1]=='H':
everycode[1]= 'Aug'
if everycode[1]=='I':
everycode[1]= 'Sep'
if everycode[1]=='J':
everycode[1] = 'Oct'
if everycode[1]=='K':
everycode[1]= 'Nov'
if everycode[1]=='L':
everycode[1]= 'Dec'
But When I execute this, it returns an error like this:
"TypeError: 'str' object does not support item assignment"
You could try something like this:
df = pd.read_csv(csv_file)
df['Batch'] = df['Batch'].apply(interpret_date)
def interpret_date(code):
_year = replace_year(code[0])
_month = replace_month(code[1])
_date = replace_date(code[2:4])
return '-'.join([_date, _month, _year])
You will need to write the replace_...() functions to map each input to the right values.
Related
I have an existing loop that I am using to go through a large amount of file paths, which ultimately sends the files through a cloud processing pipeline. I need to update the loop to match the file names with a dataframe column (fileName), then get the associated data values from a second column (date) and store that as a variable in my loop.
# dataframe that I need to extract 'date' from
df = pd.DataFrame({'id':['dat1', 'dat2', 'dat3'],
'date':[2019, 2021, 2015],
'fileName': ['dat1.file', 'dat2.file', 'dat3.file']})
# list of file paths that I need the fileName from to match with my dataframe
gs_files = ['path/dat1.file', 'path/dat2.file']
bucket = 'path/'
for f in gs_files:
# get file path
print('Path: ', f)
# get file name (need to keep this for later processing steps)
fbname = f.replace(bucket, '')
print('Image name: ', fbname)
# match fbname with df['fileName']. Store associated 'date' as a separate variable (not as a column in df)
if fbname in df['fileName']:
year = df['date']
print('Collection date: ',year)
# Extra processing steps will be executed below.
# Resulting output from the above code:
Path: path/dat1.file
Image name: dat1.file
Path: path/dat2.file
Image name: dat2.file
# Desired output:
Path: path/dat1.file
Image name: dat1.file
Collection date: 2019
Path: path/dat2.file
Image name: dat2.file
Collection date: 2021
Change this code:
if fbname in df['fileName']:
year = df['date']
print('Collection date: ',year)
to this:
if df['fileName'].isin([fbname]).any():
year = df['date'][df['fileName'] == fbname].iloc[0]
print('Collection date: ',year)
fbname in df['fileName'] doesn't work. Instead, df['fileName'].isin([fbname]) will return a new column containing True for each item in the original column that's in the list you specify ([fbname]), False otherwise. Then, .any() returns True if there is at least one True in the column it's called on.
Also, df['date'][df['fileName'] == fbname] selects the items from date where fileName is fbname. .iloc[0] gets the actual value out.
I'm trying to read and analyse data back from a molecular dynamics simulation, which looks like this, but has approximately 50000 lines :
40 443.217134221125 -1167.16960983145 -930.540717277902 -945.149746592058 14.6090293141563 -76510.1177229871 4955.17798368798 17.0485096390963 17.0485096390963 17.0485096390963
80 659.39103652059 -923.638916369481 -963.088128935875 -984.822539088925 21.7344101530497 14390.2520385682 4392.18167603894 16.3767140226773 16.3767140226773 16.3767140226773
120 410.282687399253 -979.413482414461 -978.270613122515 -991.794079036891 13.5234659143754 -416.30808174241 4398.37322990079 16.3844056974088 16.3844056974088 16.3844056974088
The second column represents temperature. I want to have the entire contents of the file inside a list, containing lists dividing every line depending on their temperature. So for example, the first list in the main list would have every line where the temperature is 50+/-25K, the second list in the main list would have every line where the temperature is 100+/-25K, the third for 150+/-25K, etc.
Here's the code I have so far :
for nbligne in tqdm(range(0,len(LogFullText),1), unit=" lignes", disable=False):
string = LogFullText[nbligne]
line = string.replace('\n','')
Values = line.split(' ')
divider = float(Values[1])
number = int(round(divider/ecart,0))
if number>0 and number < (nbpts+1):
numericValues = []
for nbresultat in range(0,len(Values)-1,1):
numericValues = numericValues + [float(Values[nbresultat+1])]
TotalResultats[number-1].append(numericValues)
The entire document with data is stored in the list LogFullText, in which I remove the \n at the end and split the data, using line.split(' '), I then know in which "section" of the main list, TotalResultats, the line of data has to be stored with the variable number, ecart has in my example a value of 50.
From my testing in idle, this should work, but in reality what happens in that the list numericValues is appended to every section of TotalResultats, which makes the entire "sorting" process pointless, as I simply end up with nbpts times the same list.
EDIT : A desired output would be for example to have TotalResultats[0] contain only these lines :
440 49.9911561170447 -1002.727121613 -1002.72088094757 -1004.36865629012 1.64777534254374 -2.30045369926927 4346.38067015602 16.319590369315 16.319590369315 16.319590369315
480 42.0678318129411 -1002.69068695093 -1003.09270361295 -1004.47931559314 1.38661198019398 148.219667654185 4345.58826561836 16.3185985476593 16.3185985476593 16.3185985476593
520 43.0855216044083 -1003.4761833678 -1003.33820025832 -1004.75835665467 1.42015639634654 -50.877194096845 4345.23364199522 16.3181546401367 16.3181546401367 16.3181546401367
Whereas TotalResults[1] would contain these :
29480 109.504432929553 -980.560226069922 -998.958927113452 -1002.5683396275 3.6094125140473 6797.60091557441 4336.52501942717 16.3072458525354 16.3072458525354 16.3072458525354
29520 106.663291994583 -987.853629557979 -998.63436605413 -1002.15013076443 3.51576471029626 3975.43407740646 4344.84444478408 16.3176674266037 16.3176674266037 16.3176674266037
29560 112.712019757891 -1020.65735849343 -998.342638324154 -1002.05777718853 3.71513886437272 -8172.25412368794 4374.81748831773 16.3551041162317 16.3551041162317 16.3551041162317
And TotalResults[2] would be :
52480 142.86322849701 -983.254970494784 -995.977110177167 -1000.68607319299 4.70896301582636 4687.60299340191 4348.30194824999 16.321994657312 16.321994657312 16.321994657312
52520 159.953459288754 -984.221801201968 -995.711657311665 -1000.9839371836 5.27227987193358 4233.04866428826 4348.82254074761 16.3226460049712 16.3226460049712 16.3226460049712
52560 161.624843851124 -1011.76969126636 -995.320907086768 -1000.64827802848 5.32737094170867 -6023.57133443538 4375.12133631739 16.3554827492176 16.3554827492176 16.3554827492176
In the first case,
TotalResultats[0][0] = [49.9911561170447, -1002.727121613, -1002.72088094757, -1004.36865629012, 1.64777534254374, -2.30045369926927, 4346.38067015602, 16.319590369315, 16.319590369315, 16.319590369315]
If it can help, I'm coding this in Visual Studio, using python 3.6.8
Thanks a whole lot!
I recommend to use pandas. It's a very powerfull tool to treat tabular data in python. It's like excel or sql inside python. Suppose 1.csv contains the data you have provided in the question. Then you can easily load data, filter it, and save results:
import pandas as pd
# load data from file into pandas dataframe
df = pd.read_csv('1.csv', header=None, delimiter=' ')
# filter by temperature, column named 0 since there is no header in the file
df2 = df[df[0].between(450, 550)]
# save filtered rows in the same format
df2.to_csv('2.csv', header=None, index=False, sep=' ')
Pandas may be harder to learn than plain python syntax but it is well worth it.
I have two columns one with values that represents time and another with values that represent a date (both values are in floating type), I have the following data in each column:
df['Time']
540.0
630.0
915.0
1730.0
2245.0
df['Date']
14202.0
14202.0
14203.0
14203.0
I need to create new columns with the correct data format for these two columns, to be able to analyze data with date and time in distinct columns.
For ['Time'] I need to convert the format to:
540.0 = 5h40 OR TO 5.40 am
2245.0 = 22h45 OR TO 10.45 pm
For ['Date'], I need to convert the format to:
Each number we can say that represent "days":
where 0 ("days") = 01-01-1980
So if I add 01-01-1980 to 14202.0 = 18-11-1938
and if I add: 01-01-1980 + 14203.0 = 19-11-1938,
this way is possible to do with excel but I need a way to do in Python.
I tried different types of code but nothing works, for example, one of the codes that I tried was the one below:
# creating a variable with the data in column ['Date'] adding the days into the date:
Time1 = pd.to_datetime(df["Date"])
# When I print it is possible to see that 14203 in row n.55384 is added at the end of the date created but including time, and is not what I want:
print(Time1.loc[[55384]])
55384 1970-01-01 00:00:00.000014203
Name: Date, dtype: datetime64[ns]
# printing the same row (55384) to check the value 14203.0, that was added above:
print(df["Date"].loc[[55384]])
55384 14203.0
Name: Date, dtype: float64
For ['Time'] I have the same problem I can't have time without a date, I also tried to insert ':', but is not working even converting the data type to string.
I hope that someone can help me with this matter, and any doubt please let me know, sometimes is not easy to explain.
regarding the time conversion:
# change to integer
tt= [int(i) for i in df['Time']]
# convert to time
time_ = pd.to_datetime(tt,format='%H%M').time
# convert from 24 hour, to 12 hour time format
[t.strftime("%I:%M %p") for t in time_]
Solving problems with Date
from datetime import datetime
from datetime import timedelta
startdate_string = "1980/01/01" #defining start date in string format
startdate_object = datetime.strptime(startdate_string, "%Y/%m/%d").date() # changing string format date, to date object using strptime function
startdate_object # print startdate_object to check date
creating a list to add in the dataframe a new column with date format
import math
datenew = []
dates = df['UTS_Date'] # data from the original column 'UTS_Date'
for values in dates: # using an if statement to accept null values and appending them into the new list
if math.isnan(values):
`datenew.append('NaN')`
`continue `
`currentdate1 = startdate_object + timedelta(days= float(values))` # add the reference data (startdate_object) to a delta (which is the value in each row of the column)
`datenew.append(str(currentdate1)) ` # converte data into string format and add in the end of the list, removing any word from the list (such: datetime.date)
print (len(datenew)) # check the length of the new list datenew, to ensure that all rows on the data are in the new list
df.insert(3, 'Date', datenew) #creating a new column in data frame for date format
solving problems with Time
timenew = [] # creating a new list
times = df['Time'] # variable times is equal to the column df['Time'] of the dataframe
variable to find the location of time that is >= 2400
i = 0
def Normalize_time (val):
`offset = 0`
`if val >= 2400:`
`offset = 1 `
# converting val into integer, to remove decimal places
hours = int(val / 100)
# remove hours and remain just with minutes
minutes = int(val) - hours * 100
# to convert every rows above 24h
hours = (hours%23) - offset
# zfill recognizes that it must have two characters (in this case) for hours and minutes
# and if there aren't enough characters,
# it will add by padding zeros on the left until reaching the number of characters in the argument
return str(hours).zfill(2) + ':' + str(minutes).zfill(2)
creating a for statement to add all the values in the new list, using 'function Normalize_time()'
for values in times:
# using an if statement to accept null values and appending them into the new list
if math.isnan(values):
`timenew.append('NaN') `
` continue `
# using values into the function 'Normalize_time()'
timestr = Normalize_time(values)
# appending each value in the new list
timenew.append(timestr)
print(len(timenew)) # check the length of new list timenew, to ensure that all rows on the data are in the new list
df.insert(4, 'ODTime', timenew) #creating a new column in data frame
I am trying to extract the Start Station from a csv file, example data below.
Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year
1423854,2017-06-23 15:09:32,2017-06-23 15:14:53,321,Wood St & Hubbard St,Damen Ave & Chicago Ave,Subscriber,Male,1992.0
The problem I am having is when I try to extract the data I receive the following error message:
AttributeError: 'Series' object has no attribute 'start'
def load_data(city, month, day):
# load data file into a dataframe
df = pd.read_csv(CITY_DATA[city])
I believe my problem stems from converting the Start Station, but can't seem to figure why.
# convert the Start Station column to dataframe
df['Start Station'] = pd.DataFrame(df['Start Station'])
# extract street names from Start Station and End Station to create new columns
df['start'] = df['Start Station'].start
def station_stats(df):
"""Displays statistics on the most popular stations and trip."""
# TO DO: display most commonly used start station
popular_start_station = df['start']
print(popular_start_station)
Your code is confusing. Just try this:
df = pd.read_csv(CITY_DATA, index = True) # load data file into a one df
start_data_series = df[['Start Station']] # create series with column of interest
You can add more columns to the second line according to your liking. For further reading, refer to this post.
I created a function that parses a filename into its constituent parts, including camera information and a time stamp. I want to preform this function (and the only part that is relevant to me is the time stamp so that is what I want to return) on a column of a CSV that contains the filename in its first column.
Exx, mean filename
0 1.14E-33 cam0_006806_418.852.csv
1 4.54E-05 cam0_006807_418.910.csv
2 4.48E-05 cam0_006808_418.975.csv
3 0.000138274 cam0_006809_419.037.csv
4 0.000118886 cam0_006810_419.097.csv
5 0.001155703 cam0_006811_419.157.csv
I want to add the parsed time to a fourth column. This is what I have so far
def csvdecode(f):
s = os.path.basename(f)
pattern = "".join([r'cam(?P<cam_id>[0-9]+)_',
r'(?P<frame_id>[0-9]+)_'
r'(?P<time>[0-9]+.[0-9]+)'])
m = re.search(pattern, s)
d = {'Camera ID': m.group('cam_id'),
'Frame ID': m.group('frame_id'),
'Timestamp (s)': float(m.group('time'))}
return d['Timestamp (s)']
# this returns only the "time" portion of the timestamp
df = pd.read_csv('results_avg_optical_strain.csv')
df['Time (s)'] = df['filename'].apply(csvdecode)
and it runs with no errors but nothing is added to the existing csv. Any help is appreciated, thanks!
To add a column to a .csv using pandas all you need to do is
df['column name'] = 'something'
This will automatically update your .csv to include that column. It will populate the value so that its length matches the length of the other columns in your .csv