I have a listing containing values with measurement unit and i want to remove them, the original list is mentioned below:
['Dawn:', 'Sunrise:', 'Moonrise:', 'Dusk:', 'Sunset:\xa0', 'Moonset:', 'Daylight:', 'Length:', 'Phase:', 'Temperature', 'Dew\xa0Point ', 'Windchill', 'Humidity', 'Heat Index', 'Apparent Temperature', 'Solar Radiation', 'Evapotranspiration Today', 'Rainfall\xa0Today', 'Rainfall\xa0Rate', 'Rainfall\xa0This\xa0Month', 'Rainfall\xa0This\xa0Year', 'Rainfall\xa0Last Hour', 'Last rainfall', 'Wind\xa0Speed\xa0(gust)', 'Wind\xa0Speed\xa0(avg)', 'Wind Bearing', 'Beaufort\xa0F1', 'Barometer\xa0', 'Rising slowly']
['07:30', '08:04', '17:03', '19:05', '18:31', '01:45', '11:35', '10:27', 'Waxing Gibbous', '13.7\xa0°C', '11.4\xa0°C', '13.7\xa0°C', '86%', '13.7\xa0°C', '13.0\xa0°C', '0\xa0W/m²', '0.15\xa0mm', '0.0\xa0mm', '0.0\xa0mm/hr', '36.4\xa0mm', '36.4\xa0mm', '0.0\xa0mm', '2018-10-14 08:52', '6.1\xa0kts', '2.6\xa0kts', '229° SW', 'Light air', '1026.89\xa0mb', '0.27\xa0mb/hr']
To remove measurement units like degrees, kts, mb etc. i follow the below approach:
newlist = [word for line in test for word in line.split()]
#print(newlist)
testlist = ['°C', 'W/m²', 'mm','mm/hr', 'mb','kts', 'mb/hr', '%']
t = [x for x in newlist for d in testlist if d in x]
s = [r for r in newlist if r not in testlist]
After this code, i am able to remove all units, but then values which are strings and are separated by spaces like Waxing Gibbous becomes comma separated. is there possible i join them back with spaces?
Result of code:
['Dawn:', 'Sunrise:', 'Moonrise:', 'Dusk:', 'Sunset:\xa0', 'Moonset:', 'Daylight:', 'Length:', 'Phase:', 'Temperature', 'Dew\xa0Point ', 'Windchill', 'Humidity', 'Heat Index', 'Apparent Temperature', 'Solar Radiation', 'Evapotranspiration Today', 'Rainfall\xa0Today', 'Rainfall\xa0Rate', 'Rainfall\xa0This\xa0Month', 'Rainfall\xa0This\xa0Year', 'Rainfall\xa0Last Hour', 'Last rainfall', 'Wind\xa0Speed\xa0(gust)', 'Wind\xa0Speed\xa0(avg)', 'Wind Bearing', 'Beaufort\xa0F1', 'Barometer\xa0', 'Rising slowly']
['07:30', '08:04', '17:03', '19:05', '18:31', '01:45', '11:35', '10:27', 'Waxing', 'Gibbous', '13.7', '11.4', '13.7', '86%', '13.7', '13.0', '0', '0.15', '0.0', '0.0', '36.4', '36.4', '0.0', '2018-10-14', '08:52', '5.2', '2.4', '188°', 'S', 'Light', 'air', '1026.21', '0.23']
Main source code from where data is being fetched:
Data origin source code
Any help would be appreciated, thanks
So your source data identified earlier is from a dict called grouped (think if you could put that back in and show an example that be great)
From group you want to get all the keys as headers and values as values but replacing all the symbols you do not need.
The code below does that for you starting from your grouped dict and stores your header and value into 2 seperate list:
headers = []
values = []
testlist = ['°C', 'W/m²', 'mm','mm/hr', 'mb','kts', 'mb/hr']
for i in a[0]:
for k,v in i.items():
headers.append(k)
values.append(v)
for idx,v in enumerate(values):
for t in testlist:
values[idx] = values[idx].replace(t,'')
for h,v in zip(headers,values):
print('Header: {} , Value : {}'.format(h,v))
It definitely is helpful in the future if you outline where your source data begins and then your expected output.
Related
I have a dataframe:
business049.txt [bmw, cash, fuel, mini, product, less, mini]
business470.txt [saudi, investor, pick, savoy, london, famou]
business075.txt [eu, minist, mull, jet, fuel, tax, european]
business101.txt [australia, rate, australia, rais, benchmark]
business060.txt [insur, boss, plead, guilti, anoth, us, insur]
Therefore, I would like the output to include a column of words and a column of filenames that contain it. It should be like:
bmw [business049.txt,business055.txt]
australia [business101.txt,business141.txt]
Thank you
This is quite possibly not the most efficient/best way to do this, but here you go:
# Create DataFrame from question
df = pd.DataFrame({
'txt_file': ['business049.txt',
'business470.txt',
'business075.txt',
'business101.txt',
'business060.txt',
],
'words': [
['bmw', 'cash', 'fuel', 'mini', 'product', 'less', 'mini'],
['saudi', 'investor', 'pick', 'savoy', 'london', 'famou'],
['eu', 'minist', 'mull', 'jet', 'fuel', 'tax', 'european'],
['australia', 'rate', 'australia', 'rais', 'benchmark'],
['insur', 'boss', 'plead', 'guilti', 'anoth', 'us', 'insur'],
]
})
# Get all unique words in a list
word_list = list(set(df['words'].explode()))
# Link txt files to unique words
# Note: list of txt files is one string comma separated to ensure single column in resulting DataFrame
word_dict = {
unique_word: [', '.join(df[df['words'].apply(lambda list_of_words: unique_word in list_of_words)]['txt_file'])] for unique_word in word_list
}
# Create DataFrame from dictionary (transpose to have words as row index).
words_in_files = pd.DataFrame(word_dict).transpose()
The dictionary word_dict might already be exactly what you need instead of holding on to a DataFrame just for the sake of using a DataFrame. If that is the case, remove the ', '.join() part from the dictionary creation, because it doesn't matter that the values of your dict are unequal in length.
I have a file and I want to append some rows to an empty list if they meet 2 conditions:
I only take the rows who have a country_code which is also present in my_countrycodes AND
for each country_code I take the MAX date-time if that date-time is < my_time1
Please note that the country_code of each row is indexed at [1] in the file and the datetime of each row is a variable named date_time4.
This is my code:
my_time = '2020-09-06 16:00:45'
my_time1 = datetime.datetime.strptime(my_time, '%Y-%m-%d %H:%M:%S')
my_countrycodes = ['555', '256', '1000']
all_row_times = [] #<--- this is the list where we will append the datetime values of the file
new_list = [] #<--- this is the final list where we will append our results
with open(root, 'r') as out:
reader = csv.reader(out, delimiter = '\t')
for row in reader:
# print(row)
date_time1 = row[-2] + row[-1] #<--- concatenate date + time
date_time2 = datetime.datetime.strptime(date_time1, '%d-%m-%Y%H:%M:%S') #<--- make a datetime object of the string
date_time3 = datetime.datetime.strftime(date_time2, '%Y-%m-%d %H:%M:%S') #<--- turn the datetime object back to a string
date_time4 = datetime.datetime.strptime(date_time3, '%Y-%m-%d %H:%M:%S') #<--- turn the string object back to a datetime object
all_row_times.append(date_time4) #<--- put all the datetime objects into a list.
if any(country_code in row[1] for country_code in my_countrycodes) and date_time4 == max(dt for dt in all_row_times if dt < my_time1):
new_list.append(row) #<-- append the rows with the same country_code in my_countrycodes and the latest time if that time is < my_time1
print(new_list)
This is how the file looks like:
enter image description here
This is the output of new_list:
[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'],
['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'],
['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'],
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '14:51:45'],
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]
As you can see the code extracts the rows with the country_codes 555, 256 and 1000, it also takes the rows which are smaller then < my_time1. So this part works perfect. However, row 1000 has 2 different date-times and I dont understand why it doesnt take just the MAX date-time.
This is the expected output of new_list:
[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'],
['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'],
['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'],
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]
Actually, it takes just the MAX date-time but in the for loop, the 14:51:45 comes first. Your code compares this to others and since the other value does not come up yet, it is taken as the max value.
On the next iteration, the other country code comes, and since it's time greater than others, this row also appends. This is what you are missing I guess.
You may try something like this.
my_time = datetime.datetime.strptime('2020-09-06 16:00:45', '%Y-%m-%d %H:%M:%S')
my_countrycodes = ['555', '256', '1000']
country_code_max_date_rel = {}
matched_rows = []
with open(root, 'r') as out:
reader = csv.reader(out, delimiter = '\t')
for row in reader:
date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
if any(country_code in row[1] for country_code in my_countrycodes):
matched_rows.append(row)
try:
if country_code_max_date_rel[str(row[1])] < date_time:
raise KeyError
except KeyError:
country_code_max_date_rel[str(row[1])] = date_time
At this point, you have max value for each country. And also the list of rows.
If you filter again like;
new_list = []
for row in matched_rows:
country_code = row[1]
date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
if date_time == country_code_max_date_rel[country_code]:
if date_time < my_time:
new_list.append(row)
The new list:
[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'],
['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'],
['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'],
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]
This code is not really good but I guess it will help you to update yours.
Sorry I'm not sure what you are trying to do here. Asuming that you want to have only one instance of contrycode in your new_list with latests time before my_tim1 here's an anwer:
Logic in you code is incorrect. Right now you are iterating through all rows from you csv file and applies the same condition before appending new row to new_list.
In given case ['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45'] is added because condition 1 is True (1000 is in my_countrycodes), and condition 2 is also True ('06-09-2020', '15:59:45' is smaller then my_time1 and it is the 'biggest' time in new_list also).
You could aproach this problem in many different ways but here are some suggestions:
Change your solution that:
Check if row[1] is in str(my_countrycodes),
Check if row time is less than my_time1
Check if row's countrycode is already on the new_list,
if it is not in new_list add it,
if it is in new_list check if new date and time match your condition and if yes update date and time column for this row.
Filter your file by country codes and then retrieve the max from filtered result per country code
Be carefull what is your key, as you have countrycode which repets itself with different parameters. ('NY', 'BS')
Suggestion and comments:
For quick access to data you could use dictionaries. Using countrycode as a key would grant you easy access to data and help you quickly check for it's existance and update it's parameters.
any(country_code in row[1] for country_code in my_countrycodes)
can be written as:
row[1] in str(my_countrycodes)
or you even could create my_country_code_str = str(my_countrycodes) before entering
for loop.
I do not know why you are casting back and forth datetime but as you only need last one it's enought to do this:
rows_date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
remember that you can format it as you like with '%d-%m-%Y%H:%M:%S'
Remember to give meaningfull names to your variables and keep one coding standard to your code (ig. when you use underscore then use it successively)
So i have this data in a file which is presented as :
Commodity, USA, Canada, Europe, China, India, Australia
Wheat,61.7,27.2,133.9,121,94.9,22.9
Rice Milled,6.3, -,2.1,143,105.2,0.8
Oilseeds,93.1,19,28.1,59.8,36.8,5.7
Cotton,17.3, -,1.5,35,28.5,4.6
The Top Row being the Header and The first column being headers as well. The dashes represent no data.
The format of the returned dictionary is as follow:
The keys of the dictionary are the names of the countries.
The values are dictionaries containing the data for each country. The keys of these dictionaries are names of commodities, the values are the quantity produced by that country for a given commodity. If there is no data for the given commodity (that is a dash in the csv file), the commodity must not be included in the dictionary. For example, cotton must not be in the dictionary for Canada. Note, a ’-’ (dash) is different than the value 0.
From file above it should be represented as :
{’Canada’:{’Wheat’:27.2,’Oilseeds’:19}, ’USA’:{’Wheat’:61.7, ’Cotton’:17.3,...}, ...}
Confused on where to start or what to do. Been stuck for days
If you do not plan to import any modules, this works too
data = {}
with open('data.txt') as f:
column_dict = {}
for i , line in enumerate(f):
vals = line.rstrip().split(',')
row_heading = vals[0]
row_data = vals[1:]
# Add column names as keys and empty dict as values for final data
# Creating a header dict to keep track of index for columns
if i ==0:
data = {col.strip():{} for col in row_data}
column_dict = {col.strip():i for i,col in enumerate(row_data)}
else:
for x in data.keys():
#Exclude data with dashes
if row_data[column_dict[x]].strip() != "-":
data[x][row_heading] = row_data[column_dict[x]]
print(data)
if you have no issue to import pandas module, then it can be done as follows
import pandas as pd
df = pd.read_csv('test2.csv', sep=',')
df.set_index('Commodity').to_json()
it will give you the following output
{" USA":{"Wheat":61.7,"Rice Milled":6.3,"Oilseeds":93.1,"Cotton":17.3}," Canada":{"Wheat":"27.2","Rice Milled":" -","Oilseeds":"19","Cotton":" -"}," Europe":{"Wheat":133.9,"Rice Milled":2.1,"Oilseeds":28.1,"Cotton":1.5}," China":{"Wheat":121.0,"Rice Milled":143.0,"Oilseeds":59.8,"Cotton":35.0}," India":{"Wheat":94.9,"Rice Milled":105.2,"Oilseeds":36.8,"Cotton":28.5}," Australia":{"Wheat":22.9,"Rice Milled":0.8,"Oilseeds":5.7,"Cotton":4.6}}
If you really want it without any imports (whysoever) the shortest thing I could come up with is the following:
with open('data_sample.txt') as f:
lines = f.readlines()
split_lines = [[i.strip() for i in l.split(',')] for l in lines]
d = {}
for i, line in enumerate(zip(*split_lines)):
if i == 0:
value_headers = line
continue
d[line[0]] = dict([(i,j) for i,j in zip(value_headers[1:], line[1:]) if j != '-' ])
print(d)
Out:
{'USA': {'Wheat': '61.7', 'Rice Milled': '6.3', 'Oilseeds': '93.1', 'Cotton': '17.3'}, 'Canada': {'Wheat': '27.2', 'Oilseeds': '19'}, 'Europe': {'Wheat': '133.9', 'Rice Milled': '2.1', 'Oilseeds': '28.1', 'Cotton': '1.5'}, 'China': {'Wheat': '121', 'Rice Milled': '143', 'Oilseeds': '59.8', 'Cotton': '35'}, 'India': {'Wheat': '94.9', 'Rice Milled': '105.2', 'Oilseeds': '36.8', 'Cotton': '28.5'}, 'Australia': {'Wheat': '22.9', 'Rice Milled': '0.8', 'Oilseeds': '5.7', 'Cotton': '4.6'}}
There might be better uses of zip etc, but it should give a general idea
I have the following results from a vet analyser
result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405
4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta
rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}
Using Python how do i separate the date the time the type PT and APTT.... please note that the results will be different everytime so i need to make a code that will find the date using the / and will get the time because of four digits and the : .... do i use a for loop?
This code makes further usage of fields easier by converting them to dict.
from pprint import pprint
result = "result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405 4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}"
if result.startswith("result{") and result.endswith("}"):
result = result[(result.index("{") + 1):result.index("}")]
# else:
# raise ValueError("Invalid data '" + result + "'")
# Separate fields
fields = result.split(";")
# Separate field names and values
# First part is the name of the field for sure, but any additional ":" must not be split, as example "date:dd/mm/yyyy HH:MM" -> "date": "dd/mm/yyyy HH:MM"
fields = [field.split(":", 1) for field in fields]
fields = {field[0]: field[1] for field in fields}
a = fields['type'].split("/")
print(fields)
pprint(fields)
print(a)
The result:
{'type': 'PT/APTT', 'error': '0', 'PT': '32.3 s', 'INR': '0.0', 'APTT': '119.2', 'code': '470433200', 'lot': '405 4H0401', 'date': '20/01/2017 06:47', 'PID': 'TREKKER20', 'index': '015', 'C1': '-0.1', 'C2': '-0.1', 'qclock': '0', 'ta rget': '2', 'name': '', 'Sex': '', 'BirthDate': '', 'operatorID': '', 'SN': '024000G0900046', 'version': 'V2.8.0.09'}
{'APTT': '119.2',
'BirthDate': '',
'C1': '-0.1',
'C2': '-0.1',
'INR': '0.0',
'PID': 'TREKKER20',
'PT': '32.3 s',
'SN': '024000G0900046',
'Sex': '',
'code': '470433200',
'date': '20/01/2017 06:47',
'error': '0',
'index': '015',
'lot': '405 4H0401',
'name': '',
'operatorID': '',
'qclock': '0',
'ta rget': '2',
'type': 'PT/APTT',
'version': 'V2.8.0.09'}
['PT', 'APTT']
Note that dictionaries are not sorted (they don't need to be in most cases as you access the fields by the keys).
If you want to split the results by semicolon:
result_array = result.split(';')
In results_array you'll get all strings separated by semicolon, then you can access the date there: result_array[index]
That's quite a bad format to store data as fields might have colons in their values, but if you have to - you can strip away the surrounding result, split the rest on a semicolon, then do a single split on a colon to get dict key-value pairs and then just build a dict from that, e.g.:
data = "result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405 " \
"4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta " \
"rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}"
parsed = dict(e.split(":", 1) for e in data[7:-1].split(";"))
print(parsed["APTT"]) # 119.2
print(parsed["PT"]) # 32.3 s
print(parsed["date"]) # 20/01/2017 06:47
If you need to further separate the date field to date and time, you can just do date, time = parsed["date"].split(), although if you're going to manipulate the object I'd suggest you to use the datetime module and parse it e.g.:
import datetime
date = datetime.datetime.strptime(parsed["date"], "%d/%m/%Y %H:%M")
print(date) # 2017-01-20 06:47:00
print(date.year) # 2017
print(date.hour) # 6
# etc.
To go straight to the point and get your type, PT, APTT, date and time, use re:
import re
from source import result_gen
result = result_gen()
def from_result(*vars):
regex = re.compile('|'.join([f'{re.encode(var)}:.*?;' for var in vars]))
matches =dict(f.group().split(':', 1) for f in re.finditer(regex, result))
return tuple(matches[v][:-1] for v in vars)
type, PT, APTT, datetime = from_result('type', 'PT', 'APTT', 'date')
date, time = datetime.split()
Notice that this can be easily extended in the event you became suddenly interested in some other 'var' in the string...
In short you can optimize this further (to avoid the split step) by capturing groups in the regex search...
I need to read in a file which has multiple data line as follows:
1 D 65.33383 BAZ 308.1043 Year 2001 Month 01 Day 01 Lat 6.90 Long 126.58 Mag 6.4 Origin Time 06:57:04.2
I need to split the file into lines, which I have done, then split each line into variables at each space.
So far I am using a nested loop that looks like:
for line in open("filename", 'r').readlines():
variable = string.split(line)
values = [variable]
for value in values
value = string.split(' ')
year, month = value[0], value [1]
My problem is that I don't know what the parts in the second for loop need to be? i.e for ... in ...
I am quite new to programming in python.
I am not fully sure what exactly you are trying to accomplish, one thing that is especially unclear is your expression: "then split each line into variables at each space".
But assuming you need to get the output which consists of a list of dictionaries each containing parsed data from the line the following should be useful for you:
data = []
with open("file.txt") as f:
for line in f:
lineData = {}
lineSplit = line.split()
for i in range(1,len(lineSplit)-1,2):
lineData[lineSplit[i]] = lineSplit[i+1]
data.append(lineData)
print data
This will get you the output which will look like this:
[{'Origin': 'Time', 'D': '65.33383', 'BAZ': '308.1043', 'Long': '126.58', 'Month': '01', 'Mag': '6.4', 'Year': '2001', 'Lat': '6.90', 'Day': '01'}]
The dictionary is unsorted so keys and values appear in random order. Notice that Origin time became keys and values because you wanted to split the line on space and there is a space between origin and times. Cheers!
In this case, using a regular expression is probably the easiest, since some of your entries contain spaces.
The following expression finds anything that is not digits, followed by something that consists only of digits, dots and colons:
import re
key_val = re.compile(r'\s*([^\d]+)\s+([\d.:]+)\s*')
mapping = dict(key_val.findall(line))
This produces a dictionary object:
>>> import re
>>> line = '1 D 65.33383 BAZ 308.1043 Year 2001 Month 01 Day 01 Lat 6.90 Long 126.58 Mag 6.4 Origin Time 06:57:04.2\n'
>>> key_val = re.compile(r'\s*([^\d]+)\s+([\d.:]+)\s*')
>>> key_val.findall(line)
[('D', '65.33383'), ('BAZ', '308.1043'), ('Year', '2001'), ('Month', '01'), ('Day', '01'), ('Lat', '6.90'), ('Long', '126.58'), ('Mag', '6.4'), ('Origin Time', '06:57:04.2')]
>>> dict(key_val.findall(line))
{'D': '65.33383', 'BAZ': '308.1043', 'Long': '126.58', 'Month': '01', 'Origin Time': '06:57:04.2', 'Mag': '6.4', 'Year': '2001', 'Lat': '6.90', 'Day': '01'}
with open('data.txt', 'r') as data:
for _input in data:
line = _input.split(' ')
data = {'Index':line[0],
'Origin Time':line[-3:][-1].strip()
}
data.update(dict(zip(line[1:-3][0::2], line[1:-3][1::2])))
print data