Cleaning dataframe- assign value in one cell to column - python

I am reading multiple CSV files from a folder into a dataframe. I loop for all the files in the folder and then concat the dataframes to obtain the final dataframe.
However the CSV file has one summary row from which I want to extract the date, and then add as a new column for all the rows in that csv/dataframe.
'''
df=pd.read_csv(f,header=None,names=['Inverter',"Day Yield",'month Yield','Year Yield','SpecificYieldDay','SYMth','SYYear','Power'],sep=';', **kwargs)
df['date']=df.loc[[0],['Day Yield']]
df
I expect ['date'] column to be filled with the date for that file for all the rows in that particular csv, but it gets filled correctly only for the first row.
Refer to image of dataframe. I want all the rows of the 'date' column to be showing 7/25/2019 instead of only the first row.
I have also added an example of one of the csv files I am reading from
csv file

If I understood correctly, the value that you want to add as a new column for all rows is in df.loc[[0],['Day Yield']].
If that is correct you can do the following:
df = df.assign(date=[df.loc[[0],['Day Yield']]]*len(df))

Related

How to add columns and delete columns in a Json file then save it into csv

I have tried to use dataframe to add columns and values into the json file, but it seems like after I tried to delete some columns it return to the original data file. I also faced a problem of not able to save it into a csv file. So was wondering maybe I cannnot use dataframe for this?
It is like a list and divided into different columns (around 30 rows in total), however some I would like to delete like the route and urls, while adding three columns with length, maxcal, mincal (all the values in these 3 columns are found in the route columns)
I had done this so far and got stucked:
import pandas as pd
import json
data = pd.read_json('fitness.json') # fitness.json is the filename of the json file
fitness2 = pd.DataFrame (fitness2)
fitness2
data.join(fitness2, lsuffix="_left") # to join the three columns into the data table
I am not sure how can I delete the columns of route, 'MapURL', 'MapURL_tc', 'MapURL_sc' then finally save as a csv like the output shown.
Thank you.
you can drop columns and then concat the two dataframes:
data.drop(['MapURL', 'MapURL_tc', 'MapURL_sc'], inplace=True, axis=1)
pd.concat([data,fitness2], axis=1) # to join the three columns into the data table

How to read Excel file starting from give row and column in Pandas

How can I read a excel file in pandas starting from a row and column, I am looking to drop some rows and columns, say my excel file contains some random data in starting rows and columns, so I would either like to begin reading at a given row,c column or drop few rows and columns. How can I achieve this ?
Typically I would like my rows to start from B21, drop everything till row 20 and Column A.
Please help.
You can read your file as normal with the pd.read_excel command, to skip the first 20 rows you use the skiprows option and then drop the columns that you do not want. In this case that column will be columnAname.
df = pd.read_excel('filename', skiprows = 20).drop([columnAname], 1)

Using iloc keeps picking the last row of multiple data files

I have a list of 50 .csv files named data1.csv, data2.csv etc, I would like plot the first row, third column of each of these files. But first I would like to check the 50 values to ensure I'm plotting the correct thing, I have:
import glob
import pandas as pd
files = glob.glob('data*.csv')
for f in sorted(files):
df = pd.read_csv(f)
print(df.iloc[0,2])
The problem here is in the last line, df.iloc[0,2] prints the 3rd column of the LAST row when I want it to print the 3rd column of the FIRST row.
Essentially print(df.iloc[0,2]) prints the same values as print(df.iloc[-1,2]) and I have no idea why.
How can I check what values the first row, third column are in all of my files?
My mistake, pd.read.csv considers headers, but my .csv files have no headers, so we need:
df = pd.read_csv(f,headers=None)

Searching CSV files with Pandas (unique id's) - Python

I am looking into searching a csv file with 242000 rows and want to sum the unique identifiers in one of the columns. The column name is 'logid' and has a number of different values i.e. 1002, 3004, 5003. I want to search the csv file using the panda dataframe and sum the amount of unique identifiers. If possible I would then like to create a new csv file that stores this information. For example if I find there are 50 logid's of 1004 I would then like to create a csv file that has column name 1004 and the count of 50 displayed below. I would do this for all unique identifiers and add them in the same csv file. I am completely new at this and have done some searching but no idea where to start.
Thanks!
As you haven't post your code I can give you an answer only about the general way it would work.
Load the CSV file into a pd.Dataframe using pandas.read_csv
Save all values which a occurence > 1 in a seperate df1 using pandas.DataFrame.drop_duplicates like:
df1=df.drop_duplicates(keep="first)
--> This Will return a DataFrame which only contains the rows with the first occurence of duplicate values. E.g. if the value 1000 is in 5 rows only the first row will be returned while the others are dropped.
--> Applying df1.shape[0] will return you the number of duplicate values in your df.
3.If you want to store all rows of df which contain a "duplicate value" in a seperate CSV file you have to do smth like this:
df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)
df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.
print(df1)
list=[]
for m in df1["A"]:
mask=(df==m)
list.append(df[mask].dropna())
for dfx in range(len(list)):
name="file{0}".format(dfx)
list[dfx].to_csv(r"YOUR PATH\{0}".format(name))

Pandas read_excel, csv; names column names mapper?

Suppose you have a bunch of excel files with ID and company name. You have N number of excel files in a directory and you read them all into a dataframe, however, in each file company name is spelled slightly differently and you end up with a dataframe with N + 1 columns.
is there a way to create a mapping for columns names for example:
col_mappings = {
'company_name': ['name1', 'name2', ... , 'nameN],
}
So that when your run read_excel you can map all the different possibilities of company name to just one column? Also could you do this with any type of datafile? E.g. read_csv ect..
Are you concatenating the files after you read them one by one? If yes, you can simply change the column name once you read the file. From you question, I assume your dataframe only contains two columns - Id and CompanyName. So, you can simply change it by indexing.
df = pd.read_csv(one_file)
df.rename(columns={df.columns[1]:'company_name'})
then concatenate it to the original dataframe.
Otherwise, simply read with given column names,
df = pd.read_csv(one_file, names=['Id','company_name'])
then remove first row from df as it contains original column names.
It can be performed on both .csv and .xlsx file.

Categories

Resources