For a dataframe which looks like this:
I want to simply set the index to be the Date column which you see as first column.
The dataframe comes from an api where i save the data into csv:
data.to_csv('stocks.csv', header=True ,sep=',',mode='a')
data = pd.read_csv('stocks.csv',header=[0,1,2])
data
Preferably i would also like to get rid of the "Unnamed:.." labels you see in the picture.
Thanks.
I solved it by specifying header=[0,1] ,index_col=0 in the read_csv function and after convert dataframe to numeric since the datatype got distorted but not necessary always i believe:
data = pd.read_csv('stocks.csv', header=[0,1] ,index_col=0)
data = data.apply(pd.to_numeric, errors='coerce')
# eventually:
data = data.dropna()
In this fashion I get exactly what I want, namely write e.g.
data['AGN.AS']['High']
and get the high values for a specific stock.
Related
I'm using pandas to load a short_desc.csv with the following columns: ["report_id", "when","what"]
with
#read csv
shortDesc = pd.read_csv('short_desc.csv')
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
#convert 'when' from UNIX timestamp to datetime
shortDesc['when'] = pd.to_datetime(shortDesc['when'],unit='s')
which results in the following:
I'm trying to remove rows that have duplicate 'report_id's by sorting by
date and getting the newest date where that 'report_id' is present with the following:
shortDesc = shortDesc.sort_values(by='when').drop_duplicates(['report_id'], keep='last')
the problem is that when I use .sort_values() in this particular dataframe the values of 'what' come out scattered across all columns, and the 'report_id' values disappear:
shortDesc = shortDesc.sort_values(by=['when'], inplace=False)
I'm not sure why this is happening in this particular instance since I was able to achieve the correct results by another dataframe with the same shape and using the same code (P.S it's not a mistake, I dropped the 'what' column in the second pic):
similar shape dataframe
desired results example with similar shape DF
I found out that:
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
was only checking if a value was not null and probably overwriting the str.isdigit() check, which caused the field "report_id" to not drop nonnumeric values. I changed this to two separate lines
shortDesc = shortDesc[shortDesc['report_id'].notnull()]
shortDesc = shortDesc[shortDesc['report_id'].str.isnumeric()]
which allowed
shortDesc.sort_values(by='when', inplace=True)
to work as intended, I am still confused as to why .sort_values(by="when") was affected by the column "report_id". So if anyone knows please enlighten me.
Trying to read excel table that looks like this:
B
C
A
data
data
data
data
data
but read excel doesn't recognizes that one column doesn't start from first row and it reads like this:
Unnamed : 0
B
C
A
data
data
data
data
data
Is there a way to read data like i need? I have checked parameters like header = but thats not what i need.
A similar question was asked/solved here. So basically the easiest thing would be to either drop the first column (if thats always the problematic column) with
df = pd.read_csv('data.csv', index_col=0)
or remove the unnamed column via
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
You can skip automatic column labeling with something like pd.read_excel(..., header=None)
This will skip random labeling.
Then you can use more elaborate computation (e.g. first non empty value) to get the labels such as
df.apply(lambda s: s.dropna().reset_index(drop=True)[0])
Recently while pre processing a raw csv file to replace the missing values I have used the mean of the values in that col, the code snippet is (assume df is the DataFrame and time is one of the column or attribute where the values are in float format representing hours)
df.time = df.time.fillna(df.time.mean())
after that the null value were successfully replaced with mean now I want to print only the rows which have been affected by the command instead of displaying the entire DataFrame how can I do that?
You can't get these rows after you have performed the operation as this information is lost.
But you can get the indices of all the rows that have null values first, and then print all of those rows afterwards:
na_rows = df.time.isnull()
df.time = df.time.fillna(df.time.mean())
print(df[na_rows])
I've completed an analysis and convert results into a data frame using result.to_frame(). And the result looks like this:
The next step is to build a pivotal table based on column 1 and sum the counts. In order to do that, I want to label the first column as "query". I used this code, but the first column cannot be renamed.
df.rename(columns = {'query':'sum of counts'}, inplace = True)
I found it shows this data frame only has one column even though I saw two.
It looks like you indeed have 1 column, and the description is actually your index. If your DataFrame is "df", try the following:
df.shape # this should show (N, 1) where N is the number of records
df['Query'] = df.index # to pull a copy of the index into a column
You should be able to proceed normally then.
I have a data frame df with a column called "Num_of_employees", which has values like 50-100, 200-500 etc. I see a problem with few values in my data. Wherever the employee number should be 1-10, the data has it as 10-Jan. Also, wherever the value should be 11-50, the data has it as Nov-50. How would I rectify this problem using pandas?
A clean syntax for this kind of "find and replace" uses a dict, as
df.Num_of_employees = df.Num_of_employees.replace({"10-Jan": "1-10",
"Nov-50": "11-50"})