Following up on my previous question
I have a list of records as shown below
taken from this table
itemImage
name
nameFontSize
nameW
nameH
conutry
countryFont
countryW
countryH
code
codeFontSize
codeW
codeH
sample.jpg
Apple
142
1200
200
US
132
1200
400
1564
82
1300
600
sample2.jpg
Orange
142
1200
200
UK
132
1200
400
1562
82
1300
600
sample3.jpg
Lemon
142
1200
200
FR
132
1200
400
1563
82
1300
600
Right now, I have one function setText which takes all the elements of a row from this table.
I only have name, country and code for now but will be adding other stuff in the future.
I want to make this code more future proof and dynamic. For example, If I added four new columns in my data following the same pattern. How do I make python automatically adjust to that? instead of me going and declaring variables in my code every time.
Basically, I want to send each 4 columns starting from name to a function then continue till no column is left. Once that's done go to the next row and continue the loop.
Thanks to #Samwise who helped me clean up the code a bit.
import os
from PIL import Image,ImageFont,ImageDraw, features
import pandas as pd
path='./'
files = []
for (dirpath, dirnames, filenames) in os.walk(path):
files.extend(filenames)
df = pd.read_excel (r'./data.xlsx')
records = list(df.to_records(index=False))
def setText(itemImage, name, nameFontSize, nameW, nameH,
conutry, countryFontSize,countryW, countryH,
code, codeFontSize, codeW, codeH):
font1 = ImageFont.truetype(r'./font.ttf', nameFontSize)
font2 = ImageFont.truetype(r'./font.ttf', countryFontSize)
font3 = ImageFont.truetype(r'./font.ttf', codeFontSize)
file = Image.open(f"./{itemImage}")
draw = ImageDraw.Draw(file)
draw.text((nameW, nameH), name, font=font1, fill='#ff0000',
align="right",anchor="rm")
draw.text((countryW, countryH), conutry, font=font2, fill='#ff0000',
align="right",anchor="rm")
draw.text((codeW, codeH), str(code), font=font3, fill='#ff0000',
align="right",anchor="rm")
file.save(f'done {itemImage}')
for i in records:
setText(*i)
Sounds like df.columns might help. It returns a list, then you can iterate through whatever cols are present.
for col in df.columns():
The answers in this thread should help dial you in:
How to iterate over columns of pandas dataframe to run regression
It sounds like you also want row-wise results, so you could nest within df.iterrows or vice versa...though going cell by cell is generally not desirable and could end up being quite slow as your df grows.
So perhaps be thinking about how you could use your function with df.apply()
Related
I have two csv files:
old file:
name size_bytes
air unknown
data/air/monitor
data/air/monitor/ambient-air-quality-oil-sands-region
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN/datapackage.json 886
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN/digest.txt 186
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN/JOSM_AMS13_SpecHg_AB_2017-04-02_EN.pdf 9033
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR/datapackage.json 886
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR/digest.txt 186
...
new file:
name size_bytes
data 0
data/air 0
data/air/monitor 0
data/air/monitor/ambient-air-quality-oil-sands-region 0
data/air/monitor/ambient-air-quality-oil-sands-region/96c679c3-709e-4a42-89c6-09f09f2b7ffe.xml 65589
data/air/monitor/ambient-air-quality-oil-sands-region/datapackage.json 13152367
data/air/monitor/ambient-air-quality-oil-sands-region/digest.txt 188
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region 0
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02 0
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR 0
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR/JOSM_AMS13_SpecHg_AB_2017-04-02_FR.pdf 9186
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/digest.txt 82
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-09 0
...
I want to compare the names from the "old file" to the names in the "new file" and get any missing names (folder or file paths).
Right now I have this:
with open('old_file.csv', 'r') as old_file:
old = set(row.split(',')[0].strip().lower() for row in old_file)
with open('new_file.csv','r') as new_file, open('compare.csv', 'w') as compare_files:
for line in new_file:
if line.split(',')[0].strip().lower() not in old:
compare_files.write(line)
This runs but the output is not correct, it prints out names that ARE in both files.
Here is the output:
data 0
data/air 0
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region 0
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ElementConcentrationPM25_OSM_AMS-sites_2016-2017.csv 736737
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ElementConcentrationPM25to10_OSM_AMS-sites_2016-2017.csv 227513
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ElementFlux_OSM_AMS-sites_2016-2017.csv 691252
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ffeae500-ea0c-493f-9b24-5efbd16411fd.xml 41399
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-APQMP-AllSites-2019.csv 169109
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-APQMP-AllSites-2020.csv 150205
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-CAPMoN-AllSites-2017.csv 4343972
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-CAPMoN-AllSites-2018.csv 3782783
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases 0
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2012.csv 1826690
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2013.csv 1890761
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2014.csv 1946788
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2015.csv 2186536
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2016.csv 2434692
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2017.csv 2150499
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2018.csv 2136853
...
Is there something wrong with my code?
Is there a better way to do this? Maybe using pandas?
Your tags mention Pandas but I don't see you using it. Either way, an outer merge should do what you want, if I understand your question:
old = pd.read_csv(path_to_old_file)
new = pd.read_csv(path_to_new_file)
df = pd.merge(old, new, on="name", how="outer")
Your post isn't super clear on what exactly you need, and I don't particularly feel like scrutinizing those file names for differences. From what I could gather, you want all the unique file paths from both csv files, right? It's not clear what you want done with the other column so I've left them alone.
I recommend reading this Stack Overflow post.
EDIT
After your clarification:
old = pd.read_csv(path_to_old_file)
new = pd.read_csv(path_to_new_file)
np.setdiff1d(old["name"], new["name"])
This will give you all the values in the name column of the old dataframe which are not present in the new dataframe.
I'm having trouble parsing a txt file (see here: File)
Here's my code
import pandas as pd
objectname = r"path"
df = pd.read_csv(objectname, engine = 'python', sep='\t', header=None)
Unfortunately it does not work. Since this question has been asked several times, I tried lots of proposed solutions (most of them can be found here: Possible solutions)
However, nothing did the trick for me. For instance, when I use
sep='delimiter'
The dataframe is created but everything ends up in a single column.
When I use
error_bad_lines=False
The rows I'm interested in are simply skipped.
The only way it works is when I first open the txt file, copy the content, paste it into google sheets, save the file as CSV and then open the dataframe.
I guess another workaround would be to use
df = pd.read_csv(objectname, engine = 'python', sep = 'delimiter', header=None)
in combination with the split function Split function
Is there any suggestion how to make this work without the need to convert the file or to use the split function? I'm using Python 3 and Windows 10.
Any help is appreciated.
Your file has tab separators but is not a TSV. The file is a mixture of metadata, followed by a "standard" TSV, followed by more metadata. Therefore, I found tackling the metadata as a separate task from loading the data to be useful.
Here's what I did to extract the metadata lines:
with open('example.txt','r') as file_handle:
file_content = file_handle.read().split('\n')
for index, line in enumerate(file_content):
if index<21 or index>37:
print(index, line.split('\t'))
Note that the lines denoting the start and stop of metadata (21 and 37 in my example) are specific to the file. I've provided the trimmed data I used below (based on your linked file).
Separately, I loaded the TSV into Pandas using
import pandas as pd
df = pd.read_csv('example.txt', engine = 'python',
sep='\t', error_bad_lines=False, header=None,
skiprows=list(range(21))+list(range(37,89)))
Again, I skipped the metadata at the start of the file and at the end of the file.
Here's the file I experimented with. I've trimmed the extra data to reduce line count.
TITLE Test123
DATA TYPE
ORIGIN JASCO
OWNER
DATE 19/03/28
TIME 16:39:44
SPECTROMETER/DATA SYSTEM
LOCALE 1031
RESOLUTION
DELTAX -0,5
XUNITS NANOMETERS
YUNITS CD [mdeg]
Y2UNITS HT [V]
Y3UNITS ABSORBANCE
FIRSTX 300,0000
LASTX 190,0000
NPOINTS 221
FIRSTY -0,78961
MAXY 37,26262
MINY -53,38971
XYDATA
300,0000 -0,789606 182,198 -0,0205245
299,5000 -0,691644 182,461 -0,0181217
299,0000 -0,700976 182,801 -0,0136756
298,5000 -0,614708 182,799 -0,0131957
298,0000 -0,422611 182,783 -0,0130073
195,0000 26,6231 997,498 4,7258
194,5000 -17,3049 997,574 4,6864
194,0000 16,0387 997,765 4,63967
193,5000 -14,4049 997,967 4,58593
193,0000 -0,277261 998,025 4,52411
192,5000 -29,6098 998,047 4,45244
192,0000 -11,5786 998,097 4,36608
191,5000 34,0505 998,282 4,27376
191,0000 28,2325 998,314 4,1701
190,5000 -13,232 998,336 4,05036
190,0000 -47,023 998,419 3,91883
##### Extended Information
[Comments]
Sample name X
Comment
User
Division
Company RWTH Aachen
[Detailed Information]
Creation date 28.03.2019 16:39
Data array type Linear data array * 3
Horizontal axis Wavelength [nm]
Vertical axis(1) CD [mdeg]
Vertical axis(2) HT [V]
Vertical axis(3) Abs
Start 300 nm
End 190 nm
Data interval 0,5 nm
Data points 221
[Measurement Information]
Instrument name CD-Photometer
Model name J-1100
Serial No. A001361635
Detector Standard PMT
Lock-in amp. X mode
HT volt Auto
Accessory PTC-514
Accessory S/N A000161648
Temperature 18.63 C
Control sonsor Holder
Monitor sensor Holder
Measurement date 28.03.2019 16:39
Overload detect 203
Photometric mode CD, HT, Abs
Measure range 300 - 190 nm
Data pitch 0.5 nm
CD scale 2000 mdeg/1.0 dOD
FL scale 200 mdeg/1.0 dOD
D.I.T. 0.5 sec
Bandwidth 1.00 nm
Start mode Immediately
Scanning speed 200 nm/min
Baseline correction Baseline
Shutter control Auto
Accumulations 3
I have I guess a moderately sized dataframe of ~500k rows and 200 columns with 8GB of memory.
My problem is that when I got to slice my data, even very small sized datasets when this gets trimmed down to 6k rows and 200 columns, that it just hangs and hangs for 10/15 min+. Then if I hit the STOP button for python interactive and re-try the process happens in 2-3 seconds.
I don't know why I can do my row-slicing in this 2-3 seconds normally. It is making it impossible to run programs as things just hang and hang and have to be manually stopped before it works.
I am following the approach laid out on the h2o webpage:
import h2o
h2o.init()
# Import the iris with headers dataset
path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
# Slice 1 row by index
c1 = df[15,:]
c1.describe
# Slice a range of rows
c1_1 = df[range(25,50,1),:]
c1_1.describe
# Slice using a boolean mask. The output dataset will include rows with a sepal length
# less than 4.6.
mask = df["sepal_len"] < 4.6
cols = df[mask,:]
cols.describe
# Filter out rows that contain missing values in a column. Note the use of '~' to
# perform a logical not.
mask = df["sepal_len"].isna()
cols = df[~mask,:]
cols.describe
The error message from the console is as follows. I have this same error message repeated several times.:
/opt/anaconda3/lib/python3.7/site-packages/h2o/expr.py in (.0)
149 return self._cache._id # Data already computed under ID, but not cached
150 assert isinstance(self._children,tuple)
--> 151 exec_str = "({} {})".format(self._op, " ".join([ExprNode._arg_to_expr(ast) for ast in self._children]))
152 gc_ref_cnt = len(gc.get_referrers(self))
153 if top or gc_ref_cnt >= ExprNode.MAGIC_REF_COUNT:
~/opt/anaconda3/lib/python3.7/site-packages/h2o/expr.py in _arg_to_expr(arg)
161 return "[]" # empty list
162 if isinstance(arg, ExprNode):
--> 163 return arg._get_ast_str(False)
164 if isinstance(arg, ASTId):
I have an issue that I can't understand why the value I'm searching doesn't appear to exist in dataframe.
I've started with combining couple numerical columns in my original dataframe and assigned to a new column. And then I've extracted a list of unique values from that combined column
~snip
self.df['Flitch'] = self.df['Blast'].map(
str) + "-" + self.df['Bench'].map(str)
self.flitches = self.df['Flitch'].unique()
~snip
Now slightly further in the code I need to get earliest date values corresponding to these unique identifiers. So I go and run a query on the dataframe:
~snip
def get_dates(self):
'''Extracts mining and loading dates from filtered dataframe'''
loading_date, mining_date = [],[]
#loop through all unique flitch ids and get their mining
#and loading dates
for flitch in self.flitches:
temp = self.df.query('Activity=="MINE"')
temp = temp.query(f'Flitch=={flitch}')
mining = temp['PeriodStartDate'].min()
mining_date.append(mining)
~snip
...and I get nothing. I can't understand why. I mean, I'm comparing the data extracted from the column to that same column and I'm not getting any matches.
I've gone and manually checked that the list of unique ids is populated correctly.
I've checked that dataframe I'm running query on does indeed has those same flitch ids.
I've manually checked for several random values from self.flitches list and it comes back as False every time.
Before I've combined those two columns and used only 'Blast' as identifier, everything worked perfectly, but now I'm not sure what is happening.
Here for example I've printed the self.flitches list:
['5252-528' '5251-528' '3030-492' '8235-516' '7252-488' '7251-488'
'2351-588' '5436-588' '1130-624' '5233-468' '1790-516' '6301-552'
'6302-552' '5444-576' '2377-564' '2380-552' '2375-564' '5253-528'
'2040-468' '2378-564' '1132-624' '1131-624' '6314-540' '7254-488'
'7253-488' '8141-480' '7250-488']
And here is data from self.df['Flitch'] column:
173 5252-528
174 5251-528
175 5251-528
176 5251-528
177 5251-528
178 5251-528
180 3030-492
181 3030-492
182 3030-492
183 3030-492
...
It looks like they have to match but they don't...
TLDR: The df.query() tool doesn't seem to work if the df's columns are tuples or even tuples converted into strings. How can I work around this to get the slice I'm aiming for?
Long Version: I have a pandas dataframe that looks like this (although there are a lot more columns and rows...):
> dosage_df
Score ("A_dose","Super") ("A_dose","Light") ("B_dose","Regular")
28 1 40 130
11 2 40 130
72 3 40 130
67 1 90 130
74 2 90 130
89 3 90 130
43 1 40 700
61 2 40 700
5 3 40 700
Along with my data frame, I also have a python dictionary with the relevant ranges for each feature. The keys are the feature names, and the different values which it can take are the keys:
# Original Version
dosage_df.columns = ['First Score', 'Last Score', ("A_dose","Super"), ("A_dose","Light"), ("B_dose","Regular")]
dict_of_dose_ranges = {("A_dose","Super"):[1,2,3],
("A_dose","Light"):[40,70,90],
("B_dose","Regular"):[130,200,500,700]}
For my purposes, I need to generate a particular combination (say A_dose = 1, B_dose = 90, and C_dose = 700), and based on those settings take the relevant slice out of my dataframe, and do relevant calculations from that smaller subset, and save the results somewhere.
I'm doing this by implementing the following:
from itertools import product
for dosage_comb in product(*dict_of_dose_ranges.values()):
dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
query_str = ' & '.join('{} == {}'.format(*x) for x in dosage_items)
**sub_df = dosage_df.query(query_str)**
...
The problem is that is gets hung up on the query step, as it returns the following error message:
TypeError: argument of type 'int' is not iterable
In this case, the query generated looks like this:
query_str = "("A_dose","Light") == 40 & ("A_dose","Super") == 1 & ("B_dose","Regular") == 130"
Troubleshooting Attempts:
I've confirmed that indeed that solution should work for a dataframe with just string columns as found here. In addition, I've also tried "tricking" the tool by converting the columns and the dictionary keys into strings by the following code... but that returned the same error.
# String Version
dosage_df.columns = ['First Score', 'Last Score', '("A_dose","Super")', '("A_dose","Light")', '("B_dose","Regular")']
dict_of_dose_ranges = {
'("A_dose","Super")':[1,2,3],
'("A_dose","Light")':[40,70,90],
'("B_dose","Regular")':[130,200,500,700]}
Is there an alternate tool in python that can take tuples as inputs or a different way for me to trick it into working?
You can build a list of conditions and logically condense them with np.all instead of using query:
for dosage_comb in product(*dict_of_dose_ranges.values()):
dosage_items = zip(dict_of_dose_ranges.keys(), dosage_comb)
condition = np.all([dosage_df[col] == dose for col, dose in dosage_items], axis=0)
sub_df = dosage_df[condition]
This method seems to be a bit more flexible than query, but when filtering across many columns I've found that query often performs better. I don't know if this is true in general though.