Pandas - Convert dictionary to dataframe - keys as columns - python

I have a folder with .csv files that contain timeseries in the following format:
1 0.950861
2 2.34248
3 2.56038
4 3.46226
...
I access these textfiles by looping over the folder containing the files and passing each textfile to a dictionary:
data_dict = {textfile: pd.read_csv(textfile, header=3, delim_whitespace=True, index_col=0) for textfile in textfiles}
I want to merge the columns, which contain the data next to each other with the dictionary keys as index (Pathname of the textfiles). They all have the same row number.
So far I tried passing the dictionary to a pd.Dataframe like this:
df = pd.DataFrame.from_dict(data_dict, orient='index')
Actually, the orientation needs to be the default 'columns', but it results in an error:
Value Error: If using all scalar values, you must pass an index
If I do so, I get the wrong result:
Excel_Output
This is how I pass the frame to Excel:
writer = pd.ExcelWriter("output.xls")
df.to_excel(writer,'data', index_label = 'data', merge_cells =False)
writer.save()
I think the error must be in passing the dictionary to the dataframe.
I tried pd.concat/merge/append but nothing returns the right result.
Thanks in Advance!

IIUC you can try list comprehension with concat:
data_list = [pd.read_csv(textfile, header=3, delim_whitespace=True, index_col=0)
for textfile in textfiles]
print (pd.concat(data_list, axis=1))

Related

Reducing the size of JSON to dataframe - Is there an equivalent 'usecols' argument for pandas.read_json?

I'm trying to reduce the size of a dataframe I have.
The source data is gzipped JSON and approx. 20 GB in s3 and each line looks like this:
{"timestamp":"1633121635","name":"www.hello.com","type":"a","value":"ipv4:1.1.1.1"}
I'm using pandas.read_json to read it in chunks and then drop the keys I don't want, e.g.
for df in pd.read_json(s3_source_data_location,
lines=True,
chunksize=20000000):
df.drop('timestamp', axis=1, inplace=True)
df.drop('type', axis=1, inplace=True)
I know I can try reducing the size by fiddling around with the datatypes for 'value' and 'name' but I want to first see if I can only read the keys I want.
pandas.read_csv has a 'usecols' argument where you can specify the columns you want to read. Hoping there is a way I can do this with JSON.
Lookig at the structure of JSON, you only want to keep name and value. So in order to do that and reduce the computing type, first create a list of columns you want to keep in the dataframe:
cols = ['name', 'value']
Then create an empty list and define your file path:
data = []
file_name = 'path-of-json-file.json'
Then open and read for specific categories:
with open(file_name, encoding='latin-1') as f:
for line in f:
doc = json.loads(line)
lst = [doc['name'], doc['value']]
data.append(lst)
df = pd.DataFrame(data=data, columns=cols)

Reset labels in Pandas DataFrame, Python

I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.

Append columns from excel file to csv file based on if statement

I have two files:
One with 'filename' and value_count columns (ValueCounts.csv)
Another with 'filename' and 'latitude' and 'longitude' columns (GeoData.xlsx)
I have started by creating dataframes for each file and the specific columns within that I intend on using. My code for this is as follows:
Xeno_values = pd.read_csv(r'C:\file_path\ValueCounts.csv')
img_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx')
df_values = pd.DataFrame(Xeno_values, columns = ['A','B'])
df_coords = pd.DataFrame(img_coords, columns = ['L','M','W'])
However when I print() each dataframe all the column values are returned as 'NaN'.
How do I correct this? And then write and if statement that iterates over the data and says:
if 'filename' (col 'A') in df_values == 'filename' (col 'W') in df_coords, append 'latitude' (col 'L') and 'longitude' (col 'M') to df_values
If any clarification is needed please do ask.
Thanks,
R
Check out the documentation for pandas read_csv and read_excel (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). These functions already return the data in a dataframe. Your code is trying to create a dataframe using a dataframe, which is fine if you don't specify columns, but will return all NaN values if you do.
So if you want to load the dataframes:
df_values = pd.read_csv(r'C:\file_path\ValueCounts.csv')
df_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx')
Will do the trick. And if you just want specific columns:
df_values = pd.read_csv(r'C:\file_path\ValueCounts.csv', usecols=['A','B'])
df_coords = pd.read_excel(r'C:\file_path\GeoData.xlsx', usecols=['L','M','W'])
Make sure that those column names do actually exist in your csv files
If you want to rename columns (make sure you're doing all columns here):
df_values.columns = ['Filename', 'Date']
For adding lat/long to df_values you could try:
df = pd.merge(df_values, df_coords[['filename', 'LAT', 'LONG']], on='filename', how='inner')
Which assumes that there are columns 'filename' in both the values and coords dataframes, and that the coords dataframes has columns 'LAT' and 'LONG' in it.
Lastly, do out a tutorial on pandas (https://www.tutorialspoint.com/python_pandas/index.htm). Becoming more familiar with it will help you wrangle data better.

How to change the order of columns while converting to excel in Pandas?

I have a dictionary, 'values', in Python. The 'values' contains lists of integers, except the 'RowHeaders'.
I would like to have the 'RowHeaders' as the first column in the excel file. In the following code, I cannot add a condition in 'from_items' method to put it as the first column. When I run this code, it doesn't put the 'RowHeaders' data in the first column.
values['RowHeaders'] = list_of_headers
for feat in features:
values.setdefault(feat, list())
for p in data:
values[feat].append(int(data[p][feat]))
writer = pd.ExcelWriter('output.xlsx')
df = pd.DataFrame.from_items([(f,values[f]) for f in values])
df.to_excel(writer, 'Sheet1', index=False)
writer.save()
Thanks.

Pandas Data Frame saving into csv file

I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)

Categories

Resources