While compiling a pandas table to plot certain activity on a tool I have encountered a rare error in the data that creates an extra 2 columns for certain entries. This means that one of my computed column data goes into the table 2 cells further on that the other and kills the plot.
I was hoping to find a way to pull the contents of a single cell in a row and swap it into the other cell beside it, which contains irrelevant information in the error case, but which is used for the plot of all the other pd data.
I've tried a couple of different ways to swap the data around but keep hitting errors.
My attempts to fix it include:
for rows in df['server']:
if '%USERID' in line:
df['server'] = df[7] # both versions of this and below
df['server'].replace(df['server'],df[7])
else:
pass
if '%USERID' in df['server']: # Attempt to fix missing server name
df['server'] = df[7];
else:
pass
if '%USERID' in df['server']:
return row['7'], row['server']
else:
pass
I'd like the data from column '7' to be replicated in 'server', only in the case of the error - where the data in the cell contains a string starting with '%USERID'
Turns out I was over-thinking this one. I took a step back, worked the code a bit and solved it.
Rather than trying to smash a one-size fits all bit of code for the all data I built separate lists for the general data and 2 exception I found, by writing a nested loop and created 3 data frames. These were easy enough to then manipulate individually, and finally concatenate together. All working fine now.
Related
I'm pretty new to coding and have a problem resampling my dataframe with Pandas. I need to resample my data ("value") to means for every 10 minutes (13:30, 13:40, etc.). The problem is: The data start around 13:36 and I can't access them by hand because I need to do this for 143 dataframes. Resampling adds the mean at the respective index (e.g. 13:40 for the second value), but because 13:30 is not part of my indices, that value gets lost.
I'm trying two different approaches here: First, I tried every option of resample() (offset, origin, convention, ...). Then I tried adding the missing values manually with a loop, which doesn't run properly because I didn't know how to access the correct spot on the list. The list does include all relevant values though. I also tried adding a row with 13:30 as the index on top of the dataframe but didn't manage to convince Python that my index is legit because it's a timestamp (this is not in the code).
Sorry for the very rough code, it just didn't work in several places which is why I'm asking here.
If you have a possible solution, please keep in mind that it has to function within an already long loop because of the many dataframes I have to work on simultaneously.
Thank you very much!
df["tenminavg"] = df["value"].resample("10Min").mean()
df["tenminavg"] = df["tenminavg"].ffill()
ls1 = df["value"].resample("10Min").mean() #my alternative: list the resampled values in order to eventually access the first relevant timespan
for i in df.index: #this loop doesn't work. It should add the value for the first 10 min
if df["tenminavg"][i]=="nan":
if datetime.time(13,30) <= df.index.time < datetime.time(13,40):
df["tenminavg"][i] = ls1.index.loc[i.floor("10Min")]["value"] #tried to access the corresponding data point in the list
else:
continue
I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.
I got a few (15) data frames. They contain values based on one map, but they have fragmentary form.
List of samples looks like A1 - 3k records, A2 - 6k records. B1 - 12k records, B2- 1k records, B3 - 3k records. C1... etc.
All files have the same format and it looks that:
name sample position position_ID
String1 String1 num1 num1
String2 String2 num2 num2
...
All files come from a variety of biological microarrays. Different companies have different matrices, hence the scatter in the size of files. But each of them is based on one common, whole database. Just some of the data from the main database is selected. Therefore, individual records can be repeated between files. I want to see if they are compatible.
What do I want to achieve in this task?
I want to check that all records are the same in terms of name in all files have the same position and pos_ID values.
If the tested record with the same name differs in values in any file, it must be written to error.csv.
If it is everywhere the same - result.csv.
And to be honest I do not know how to bite it, so I am guided here with a hint that someone is taking good advise to me. I want do it in python.
I have two ideas.
Load in Pandas all files as one data frame and try to write a function filtering whole DF record by record (for loop with if statements?).
Open separate all files by python read file and adding unique rows to the new list, and when read function would encounter again the same recordName, it would check it with previous. If all rest of values are tha same it will pass it without writing, if no, the record will be written in error.csv.
I am afraid, however, that these may not be the most optimal methods, hence asking you for advice and directing me for something better? I have read about numpy, I have not studied it yet, but maybe it is worth it to be in the context of this task? Maybe there is a function that has already been created for this, and I do not know about it?
Can someone help a more sensible (maybe easier) solution?
I think I have a rough idea of where you are going. This is how I would approach it
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df1["filename"] ="file1.csv"
df2["filename"] ="file2.csv"
df_total = pd.concat([df1,df2],axis=1) # stacks them vertically
df_total_no_dupes = df_total.drop_duplicates() # drops duplicate rows
# this gives you the cases where id occures more than once
name_counts = df_total_no_dupes.groupby("name").size().reset_index(name='counts')
names_which_appear_more_than_once = name_counts[name_counts["counts"] > 1]["name"].unique()
filter_condition = df_total_no_dupes["name"].isin(names_which_appear_more_than_once)
# this should be your dataframe where there are at least two rows with same name but different values.
print(df_total_no_dupes[filter_condition].sort_values("name"))
I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!
I get an excel from someone and I need to read the data every month. The format is not stable each time, and by saying "not stable" I mean:
Where the data starts changes: e.g. Section A may start on row 4, column D this time, but next time it may start at row 2, column E.
Under each section there are tags. The number of tags may change as well. But every time I only need the data in tag_2 and tag_3 (these two will always show up)
The only data that I need is from tag_2, tag_3, for each month (month1 - month8). And I want to find a way using Python first locate the section name, then find tag_2, tag_3 under that section, then get the data for month1 to month8 (number of months may change as well).
Please note that I do NOT want to locate the data that I need by specifying locations in excel since the locations change every time. How to I do this?
The end product should be a pandas dataframe that has monthly data for tag_2, tag_3, with a column that says which section the data come from.
Thanks.
I think you can directly read it as a comma separated text file. Based on what you need you can look at the tag2 ant tag3 for each line.
with open(filename, "r") as fs:
for line in fs:
cell_list = line.split(",")
# This point you will have all elements on the line as a list
# you can check for the size and implement your logic
Assuming that the (presumably manually pasted) block of information is unlikely to end up in the very bottom-right corner of the excel sheet, you could simply iterate over rows and columns (set maximum values for each to prevent long searching times) until you find a familiar value (such as "Section A") and go from there.
Unless I misunderstood you, the rest of the format should consistent between the months so you can simply assume that "month_1" is always one cell up and two to the right of that initial spot.
I have not personally worked with excel sheets in python, so I cannot state whether the following is possible in python, but it definitely works in ExcelVBA:
You could just as well use the Range.find() method to find the value "Section A" and continue with the same process as above, perhaps writing any results to a txt file and calling your python script from there if neccessary.
I hope this helps a little.