I have a dataset for which I am not able to call the columns. In the screen shoot below, I have marked in yellow what I need to be recognized as column (Vale On, Petroleo etc.) and the Date column, which I need to recognize as date since I am working with time series data.
I have tried to reset index and some solutions related but nothing worked. I am new to Python, so I am sorry if it is too obvious.
# use first row as column names
df.columns = df.iloc[0]
# and then drop it
df = df.iloc[1:]
# convert first col to date
# if it doesnt work, try passing format=... refer https://strftime.org/
# also https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
df['Date'] = pd.to_datetime(df['Date'])
A debugging hint if parsing the date keeps failing is to check if your date strings are consistent, perhaps like so: df['Date'].str.len().value_counts(). That should hopefully return only one length. If that returns multiple rows, that means you have inconsistent and anomalous data which you'll have to clean.
Related
I'm using pandas to load a short_desc.csv with the following columns: ["report_id", "when","what"]
with
#read csv
shortDesc = pd.read_csv('short_desc.csv')
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
#convert 'when' from UNIX timestamp to datetime
shortDesc['when'] = pd.to_datetime(shortDesc['when'],unit='s')
which results in the following:
I'm trying to remove rows that have duplicate 'report_id's by sorting by
date and getting the newest date where that 'report_id' is present with the following:
shortDesc = shortDesc.sort_values(by='when').drop_duplicates(['report_id'], keep='last')
the problem is that when I use .sort_values() in this particular dataframe the values of 'what' come out scattered across all columns, and the 'report_id' values disappear:
shortDesc = shortDesc.sort_values(by=['when'], inplace=False)
I'm not sure why this is happening in this particular instance since I was able to achieve the correct results by another dataframe with the same shape and using the same code (P.S it's not a mistake, I dropped the 'what' column in the second pic):
similar shape dataframe
desired results example with similar shape DF
I found out that:
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
was only checking if a value was not null and probably overwriting the str.isdigit() check, which caused the field "report_id" to not drop nonnumeric values. I changed this to two separate lines
shortDesc = shortDesc[shortDesc['report_id'].notnull()]
shortDesc = shortDesc[shortDesc['report_id'].str.isnumeric()]
which allowed
shortDesc.sort_values(by='when', inplace=True)
to work as intended, I am still confused as to why .sort_values(by="when") was affected by the column "report_id". So if anyone knows please enlighten me.
I have the following Dataframe
Date A AAPL FB GOOG MSFT WISE.L
2021-10-15 153.270004 144.839996 324.76001 2833.5 304.209991 900.0
I am trying to write a code that will check if the df.columns have any string with ".L" at the end, and then change it's value. For example: In the df above I want to reach 900.0 and change it.
Note: the strings that contain ".L" can be numerous, and have different name, all depends on user input, so i'll need to fetch all of them and change them at once.
Is it possible to do it or I should find out a different way to do it?
--Editing my question after #Kosmos suggestion
Create a list of the columns which ends with “.L”.
col_list = [col for col in df.columns of col.endswith(”.L”)]
#Kosmos suggetion works well so I tweaked it to:
for col in df.columns:
if col.endswith(".L"):
#do something
In the #do something space i'll need to convert the value stored in the columns with ".L" values (convert the number to USD) which I already know how to do, the issue is how can I access and change it on the frame without exctracting and inserting it again?
Create a list of the columns which ends with “.L”.
col_list = [col for col in df.columns if col.endswith(”.L”)]
The following operation gives a frame with only the columns which ends with “.L”.
df.loc[:,col_list]
After update
1st Solution
I see your problem. The list comprehension i suggested is not immediately suitable (it could be fixed using a custom function). I think that you are very close to done with your new suggestion. Editing the df column-wise can be done as such:
for col in df.columns:
if col.endswith(".L"):
df.loc[:,col] = df.loc[:,col]*arbitrary_value
2nd solution
Note that if all columns in col_list is aggregated using the same value (e.g. converting to USD), the following can also be done:
df.loc[:,col_list] = df.loc[:,col_list]*arbitrary_value
I'm struggling with something I thought would be fairly trivial. I have a spreadsheet that provides data in the format below, unfortunately this can't be changed, this is the only way it can be provided:
I load the file in pandas in a jupyter notebook, I can read it, specifying the header has 3 rows, so far so good. The point is that because some of the headers in the second level repeat themselves (teachers, students, other), I want to combine the 3 levels into one, so I can easily identify which columns does what. The data in the top left corner changes every day, hence I renamed that one column with nothing (''). The output I'm looking for should have the following columns: country, region, teachers_present, ..., perf_teachers_score, ..., count_teachers etc.
For some reason, pandas renders this table like this:
It doesn't add any Unnamed column name placeholders on level 0, but it does that on level 1 and 2. If I concatenate the names, I get some very ugly column names. I need to concatenate them but ignore the Unnamed ones in the process. My code is:
df = pd.read_excel(src, header=[0,1,2])
# to get rid of the date, works as intended
df.columns.set_levels(['', 'perf', 'count'], level=0, inplace=True)
# doesn't work, tells me str has no str method, despite successfully using this function elsewhere
df.columns.set_levels(['' if x.str.contains('unnamed', case=False, na=False) else x for x in df.columns.levels[1].values], level=1, inplace=True)
In conclusion, what am I doing wrong and how do I get my column names concatenated without the Unnamed (and unwanted) labels?
Thank you!
Got it...
df.columns = [f'{x}{z}' if 'unnamed' in y.lower() else f'{x}{y}' if 'unnamed' in z.lower() else f'{x}{y}{z}' for x, y, z in df.columns]
Thank you David, you've been helpful!
I am trying to filter in Pandas for a selected period. (picture added) The start and the end date is to be entered in an input box.
"Day" is found in the index column, however the third line gives an error message.
df = pd.read_excel('prices.xlsx', index_col=0)
df.iloc[::-1]
filtered_date = df[(df['Day'] >= 'start_date') & (df['Day'] <= 'end_date')]
I get a "KeyError" message. I googled that key_error happens if a key is not available in a dictionary. I do not use a dictionary in my code, and I do not understand how to fix it. The key "Day" is indeed the first value of the first(index) row.
Thank you.
This is my dataframe
Look at the picture of your DataFrame:
All names of "regular" columns are located a bit higher.
"Day" is located a bit lower and it indicates that it is the name of
the index column.
Your code contains df['Day'], so you attempt to reference a regular
column named Day.
As regular column of this name does not exist, an exception is thrown.
There are 2 ways to cope with this:
Drop index_col=0 from the call to read_excel. This way Day will
be a regular column, so your following code should work.
Change df['Day'] to df.index. This way you refer to the index.
Of course, put any valid date string instead of start_date and end_date.
And one more thing to consider: As Day is a matter of fact a column holding
dates, it should have datetime type. So probably you should add
parse_dates=[0] parameter to read_excel, to have this column converted from
string to datetime.
I have a large number of time series, with blanks on certain dates for some of them. I read that with xlwings from an XL sheet:
Y0 = xw.Range('SomeRangeinXLsheet').options(pd.DataFrame, index=True , header=3).value
I'm trying to create a filter to run regressions on those series so I have to take out the void dates. If I :
print(Y0.iloc[:,[i]]==Y0.iloc[:,[i]])
I get a proper series of true/false for my column number i, fine.
I'm then stuck, can't find a way to filter the whole df, with the true/false for that column, or even just extract that clean series as a pd.Series.
I need them one by one to adapt my independent variables' dates to those of my each of these separately.
Thank you for your help.
I believe you want to use df.dropna()
I am not sure if I understood your problem, but if you want to for check NULLs in a specific column and drop those rows, you can try this -
import pandas as pd
df = df[pd.notnull(df['column_name'])]
For deleting NaNs, df.dropna() should work, as suggested in the previous answer. If it is not working, you can try replacing NaNs with a placeholder text and try deleting the rows that contain that placeholder text.
df['column_name'] = df['column_name'].replace(np.nan, 'delete-it', regex = True)
df = df[df["column_name"] != 'delete-it']
Hope this helps!