Extract data from string in python - python

I have a .csv file with data constructed as [datetime, "(data1, data2)"] in rows and I have managed to import the data into python as time and temp, the problem I am facing is how do I seperate the temp string into two new_temp columns in float format to use for plotting later on?
My code so far is:
import csv
import matplotlib.dates as dates
def getColumn(filename, column):
results = csv.reader(open('logfile.csv'), delimiter = ",")
return [result[column] for result in results]
time = getColumn("logfile.csv",0)
temp = getColumn("logfile.csv",1)
new_time = dates.datestr2num(time)
new_temp = [???]
When I print temp I get ['(0.0, 0.0)', '(64.4164, 66.2503)', '(63.4768, 65.4108)', '(62.7148, 64.6278)', '(62.0408, 63.9625)', '(61.456, 63.2638)', '(61.0234, 62.837)', '(60.6823, 62.317)',...etc]
If anyone can help me then thanks in advance.

You may use this code:
import re
string = "['(0.0, 0.0)', '(64.4164, 66.2503)', '(63.4768, 65.4108)', '(62.7148, 64.6278)', '(62.0408, 63.9625)', '(61.456, 63.2638)', '(61.0234, 62.837)', '(60.6823, 62.317)']"
data = re.findall('[+-]?\d+\.\d+e?[+-]?\d*', string)
data = zip(data[0::2], data[1::2])
print [float(d[0]) for d in data]
print [float(d[1]) for d in data]

Going from the answer in
Parse a tuple from a string?
from ast import literal_eval as make_tuple
temp = ['(0.0, 0.0)', '(64.4164, 66.2503)', '(63.4768, 65.4108)', '(62.7148, 64.6278)', '(62.0408, 63.9625)', '(61.456, 63.2638)', '(61.0234, 62.837)', '(60.6823, 62.317)']
tuples = [make_tuple(stringtuple) for stringtuple in temp]
and you have an array of tuples of doubles.
I undeleted the post and made it a full answer, because apparently it wasn't clear enough to reference the other post.

Related

Read in a specific way from a csv file with pandas python

I have a data in a csv file here is a sample:
firstnb,secondnb,distance
901,19011,459.73618164837535
901,19017,492.5540450352788
901,19018,458.489289271722
903,13019,167.46632044684435
903,13020,353.16001204909657
the desired output:
901,19011,19017,19018
903,13019,13020
As you can see in the output I want to take firstnb column (901/903)
and put beside each one the secondnb I believe you can understand from the desired output better than my explanation :D
What I tried so far is the following:
import pandas as pd
import csv
df = pd.read_csv('test.csv')
with open('neighborList.csv','w',newline='') as file:
writer = csv.writer(file)
secondStation = []
for row in range(len(df)):
firstStation = df['firstnb'][row]
for x in range(len(df)):
if firstStation == df['firstnb'][x]:
secondStation.append(df['secondnb'][x])
# line = firstStation ,secondStation
# writer.writerow(line)
print(firstStation,secondStation)
secondStation = []
my code output this :
901 [19011, 19017, 19018]
901 [19011, 19017, 19018]
901 [19011, 19017, 19018]
903 [13019, 13020]
903 [13019, 13020]
Pandas has a built in function to do this, called groupby:
df = pd.read_csv(YOUR_CSV_FILE)
df_grouped = list(df.groupby(df['firstnb'])) # group by first column
# chain keys and values into merged list
for key, values in df_grouped:
print([key] + values['secondnb'].tolist())
Here I just print the sublists; you can save them into a new csv in any format you'd like (strings, ints, etc)
First, I grouped the data by firstnb, creating a list of the values in secondnb using the aggregate function.
df[['firstnb','secondnb']].groupby('firstnb').aggregate(func=list).to_dict()
By turning this into a dict, we get:
{'secondnb': {901: [19011, 19017, 19018], 903: [13019, 13020]}}
I'm not entirely clear on what the final output should be (plain strings, lists, …), but from here on, it's easy to produce whatever you'd like.
For example, a list of lists:
intermediate = df[['firstnb','secondnb']].groupby('firstnb').aggregate(func=list).to_dict()
[[k] + v for k,v in intermediate['secondnb'].items()]
Result:
[[901, 19011, 19017, 19018], [903, 13019, 13020]]
def toList(a):
res = []
for r in a:
res.append(r)
return res
df.groupby('firstnb').agg(toList)

TypeError: can't convert type 'NoneType' to numerator/denominator

Here I try to calculate mean value based on the data in two list of dicts. Although I used same code before, I keep getting error. Is there any solution?
import pandas as pd
data = pd.read_csv('data3.csv',sep=';') # Reading data from csv
data = data.dropna(axis=0) # Drop rows with null values
data = data.T.to_dict().values() # Converting dataframe into list of dictionaries
newdata = pd.read_csv('newdata.csv',sep=';') # Reading data from csv
newdata = newdata.T.to_dict().values() # Converting dataframe into list of dictionaries
score = []
for item in newdata:
score.append({item['Genre_Name']:item['Ranking']})
from statistics import mean
score={k:int(v) for i in score for k,v in i.items()}
for item in data:
y= mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(','))))
print(y)
Too see csv files: https://repl.it/#rmakakgn/SVE2
.get method of dict return None if given key does not exist and statistics.mean fail due to that, consider that
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
print(statistics.mean(data))
result in:
TypeError: can't convert type 'NoneType' to numerator/denominator
You need to remove Nones before feeding into statistics.mean, which you can do using list comprehension:
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
data = [i for i in data if i is not None]
print(statistics.mean(data))
or filter:
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
data = filter(lambda x:x is not None,data)
print(statistics.mean(data))
(both snippets above code will print 2)
In this particular case, you might get filter effect by replacing:
mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(','))))
with:
mean([i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None])
though as with most python built-in and standard library functions accepting list as sole argument, you might decide to not build list but feed created generator directly i.e.
mean(i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None)
For further discussion see PEP 202 xor PEP 289.

Pandas extract informations [duplicate]

This question already has answers here:
Selection with .loc in python
(5 answers)
Closed 4 years ago.
This is my Pandas Frame:
istat cap Comune
0 1001 10011 AGLIE'
1 1002 10060 AIRASCA
2 1003 10070 ALA DI STURA
I want to reproduce the equivalent SQL query:
Select cap
from DataFrame
where Comune = 'AIRASCA'
Obtaining:
cap
10060
I tried to achieve this with dataframe.loc() but i cannot retrieve what i need.
And this is my Python code:
import pandas as pd
from lxml import etree
from pykml import parser
def to_upper(l):
return l.upper()
kml_file_path = '../Source/Kml_Regions/Lombardia.kml'
excel_file_path = '../Source/Milk_Coverage/Milk_Milan_Coverage.xlsx'
zip_file_path = '../Source/ZipCodes/italy_cap.csv'
# Read zipcode csv
zips = pd.read_csv(zip_file_path)
zip_df = pd.DataFrame(zips, columns=['cap', 'Comune']).set_index('Comune')
zips_dict = zips.apply(lambda x: x.astype(str).str.upper())
# Read excel file for coverage
df = pd.ExcelFile(excel_file_path).parse('Comuni')
x = df['City'].tolist()
cities = list(map(to_upper, x))
#-----------------------------------------------------------------------------------------------#
# Check uncovered
# parse the input file into an object tree
with open(kml_file_path) as f:
tree = parser.parse(f)
# get a reference to the "Document.Folder" node
uncovered = tree.getroot().Document.Folder
# iterate through all "Document.Folder.Placemark" nodes and find and remove all nodes
# which contain child node "name" with content "ZONE"
for pm in uncovered.Placemark:
if pm.name in cities:
parent = pm.getparent()
parent.remove(pm)
else:
pass
# convert the object tree into a string and write it into an output file
with open('../Output/Uncovered_Milkman_LO.kml', 'w') as output:
output.write(etree.tostring(uncovered, pretty_print=True))
#---------------------------------------------------------------------------------------------#
# Check covered
with open(kml_file_path) as f:
tree = parser.parse(f)
covered = tree.getroot().Document.Folder
for pmC in covered.Placemark:
if pmC.name in cities:
pass
else:
parentCovered = pmC.getparent()
parentCovered.remove(pmC)
# convert the object tree into a string and write it into an output file
with open('../Output/Covered_Milkman_LO.kml', 'w') as outputs:
outputs.write(etree.tostring(covered, pretty_print=True))
# Writing CAP
with open('../Output/Covered_Milkman_LO.kml', 'r') as f:
in_file = f.readlines() # in_file is now a list of lines
# Now we start building our output
out_file = []
cap = ''
#for line in in_file:
# out_file.append(line) # copy each line, one by one
# Iterate through dictionary which is a list transforming it in a itemable object
for city in covered.Placemark:
print zips_dict.loc[city.name, ['Comune']]
I cannot understand the errors python is giving me, what i'm doing wrong? Technically i can look for a key by finding a value in pandas, is it correct?
I think is not similar to the possible duplicate question because i'm asking to retrieve a single value instead of a column.
ksooklall's answer should work just fine but (unless I'm remembering incorrectly) it's a bit faux pas to use back to back brackets in pandas -it's a bit slower than using loc and can actually matter when using larger dataframes with many calls.
Using loc like this should work just fine:
df.loc[df['Comune'] == 'AIRASCA', 'cap']
Try this:
cap = df[df['Comune'] == 'AIRASCA']['cap']
You can use eq
Ex:
import pandas as pd
df = pd.DataFrame({"istat": [1001, 1002, 1003], "cap": [10011, 10060, 10070 ], "Comune": ['AGLIE', 'AIRASCA', 'ALA DI STURA']})
print( df.loc[df["Comune"].eq('AIRASCA'), "cap"] )
Output:
1 10060
Name: cap, dtype: int64

Storing Datetime in a matrix to be used to define points of interest (Python)

I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys!
NUMBER_OF_CLASSES = 4
SUBSPACE_DIMENSION = 3
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
Pdata_1 = Pdata_1.as_matrix()
startdatetime = []
enddatetime = []
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])
Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method :
dates = PeriodList[['START','END']]
dates['START'] = pd.to_datetime(dates['START'])
dates['END'] = pd.to_datetime(dates['END'])
# You can access the dates based on index using iloc
dates.iloc[3]
#If you Start date you can use the column name
dates.iloc[3]['START']
Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values
start_end = dict(zip(dates.index, dates.values))
If you are looking for the difference of the end date and start date you can simply subtract the columns i.e
dates['Difference'] = dates['END']-dates['START']
I suggest you to go through pandas documentation for more info about accessing the data here
Edit :
You can also use dictionary in your code i.e
startdatetime = {}
enddatetime = {}
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
Hope this helps
Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty
Code:
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
#Pdata_1 = Pdata_1.as_matrix()
startdatetime = ['' for x in range(list_m)]
enddatetime = ['' for x in range(list_m)]
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

How to convert an array of dates (format 'mm/dd/yy HH:MM:SS') to numerics?

I have recently (1 week) decided to migrate my work to Python from matlab. Since I am used to matlab, I am finding it difficult sometimes to get the exact equivalent of what I want to do in python.
Here's my problem:
I have a set of csv files that I want to process. So far, I have succeeded in loading them into groups. Each column has a size of more 600000 x 1. In one of the columns in the csv file is the time which has a format of 'mm/dd/yy HH:MM:SS'. I want to convert the time column to number and I am using date2num from matplot lib for that. Is there a 'matrix' way of doing it? The command in matlab for doing that is datenum(time, 'mm/dd/yyyy HH:MM:SS') where time is a 600000 x 1 matrix.
Thanks
Here is an example of the code that I am talking about:
import csv
import time
import datetime from datetime
import date from matplotlib.dates
import date2num
time = []
otherColumns = []
for d in csv.DictReader(open('MyFile.csv')):
time.append(str(d['time']))
otherColumns.append(float(d['otherColumns']))
timeNumeric = date2num(datetime.datetime.strptime(time,"%d/%m/%y %H:%M:%S" ))
you could use a generator:
def pre_process(dict_sequence):
for d in dict_sequence:
d['time'] = date2num(datetime.datetime.strptime(d['time'],"%d/%m/%y %H:%M:%S" ))
yield d
now you can process your csv:
for d in pre_process(csv.DictReader(open('MyFile.csv'))):
process(d)
the advantage of this solution is that it doesn't copy sequences that are potentially large.
Edit:
So you the contents of the file in a numpy array?
reader = csv.DictReader(open('MyFile.csv'))
#you might want to get rid of the intermediate list if the file is really big.
data = numpy.array(list(d.values() for d in pre_process(reader)))
Now you have a nice big array that allows all kinds of operations. You want only the first column to get your 600000x1 matrix:
data[:,0] # assuming time is the first column
The closest thing in Python for matlab's matrix/vector operation is list comprehension. If you would like to apply a Python function on each item in a list you could do:
new_list = [date2num(data) for data in old_list]
or
new_list = map(date2num, old_list)

Categories

Resources