Append columns from excel file to csv file based on if statement - python

I have two files:
One with 'filename' and value_count columns (ValueCounts.csv)
Another with 'filename' and 'latitude' and 'longitude' columns (GeoData.xlsx)
I have started by creating dataframes for each file and the specific columns within that I intend on using. My code for this is as follows:
Xeno_values = pd.read_csv(r'C:\file_path\ValueCounts.csv')
img_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx')
df_values = pd.DataFrame(Xeno_values, columns = ['A','B'])
df_coords = pd.DataFrame(img_coords, columns = ['L','M','W'])
However when I print() each dataframe all the column values are returned as 'NaN'.
How do I correct this? And then write and if statement that iterates over the data and says:
if 'filename' (col 'A') in df_values == 'filename' (col 'W') in df_coords, append 'latitude' (col 'L') and 'longitude' (col 'M') to df_values
If any clarification is needed please do ask.
Thanks,
R

Check out the documentation for pandas read_csv and read_excel (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). These functions already return the data in a dataframe. Your code is trying to create a dataframe using a dataframe, which is fine if you don't specify columns, but will return all NaN values if you do.
So if you want to load the dataframes:
df_values = pd.read_csv(r'C:\file_path\ValueCounts.csv')
df_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx')
Will do the trick. And if you just want specific columns:
df_values = pd.read_csv(r'C:\file_path\ValueCounts.csv', usecols=['A','B'])
df_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx', usecols=['L','M','W'])
Make sure that those column names do actually exist in your csv files
If you want to rename columns (make sure you're doing all columns here):
df_values.columns = ['Filename', 'Date']
For adding lat/long to df_values you could try:
df = pd.merge(df_values, df_coords[['filename', 'LAT', 'LONG']], on='filename', how='inner')
Which assumes that there are columns 'filename' in both the values and coords dataframes, and that the coords dataframes has columns 'LAT' and 'LONG' in it.
Lastly, do out a tutorial on pandas (https://www.tutorialspoint.com/python_pandas/index.htm). Becoming more familiar with it will help you wrangle data better.

Related

Pandas, I get dataframe full of nan when reading from xlsx

I am reading from an Excel file ".xslx", it's consist of 3 columns, but when I read from it, I get a DF full of nans, I checked the table in Excel, it consists of normal cells no formulas no hyperlinks.
My code:
data = pd.read_excel("Data.xlsx")
df = pd.DataFrame(data, columns=["subreddit_group", "links/caption", "subreddits/flair"])
print(df)
Here is the excel file:
Here is the output:
The column parameter of pd.Dataframe() function doesn't set column names in result dataframe, but selects columns from the original file.
See pandas documentation :
Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
So you shouldn't provide column parameter and after the file is read, rename columns of the dataframe:
df = pd.DataFrame(data)
df.columns = ['subreddit_group', 'links/caption', 'def']

Reset labels in Pandas DataFrame, Python

I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.

Pandas dataframe read_excel does not consider blank upper left cells as columns?

I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()

Concatenate 2 Rows to be header/column names

I have an excel sheet that is really poorly formatted. The actual column names I would like to use are across two rows; For example, if the correct column name should be Labor Percent, cell A1 would contain Labor, and cell A2 would contain Percent).
I try to load the file, here's what I'm doing:
import os
os.getcwd()
os.chdir(r'xxx')
import pandas as pd
file = 'problem.xls'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df = xl.parse('WEEKLY NUMBERS', skiprows=35)
As you can see in the picture, the remainder of what should be the column name is in the second row. Is there a way to rename the columns by concatenating? Can this somehow be done with the header= argument in the xl.parse bit?
You can rename the columns yourself by setting:
df.columns = ['name1', 'name2', 'name3' ...]
Note that you must specify a name for every column.
Then drop the first row to get rid of the unwanted row of column names.
df = df.drop(0)
Here's something you can try. Essentially it reads in the first two rows as your header, but treats it as a hierarchical multi-index. The second line of code below then flattens that multi-index down to a single string. I'm not 100% certain it will work for your data but is worth a try - it worked for the small dummy test data I tried it with:
df = pd.read_excel('problem.xlsx', sheetname='WEEKLY NUMBERS', header=[0, 1])
df.columns = df.columns.map(' '.join)
The second line was taken from this answer about flattening a multi-index.

Pandas Data Frame saving into csv file

I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)

Categories

Resources