Concatenating multiple csv files into one dataframe gives me additional column

Concatenating multiple csv files into one dataframe gives me additional column - python

My CSV has only two columns like:
I have multiple CSVs that have this format and tried to concatenate all of them using this:
path = r'...'
all_files = glob.glob(path + '/*.csv')
df_from_each_file = [pd.read_csv(f, error_bad_lines=False) for f in all_files]
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
However, this gives me
a a.1 a.2 command description
0 NaN NaN NaN aaa ikegroup WORD Name of the IKE group
1 NaN NaN NaN aaa ikegroup <cr> NaN
2 NaN NaN NaN aaa locald trace Show trace data for the locald component(cisco...
3 NaN NaN NaN aaa login trace Show trace data for login sub system
4 NaN NaN NaN aaa password-policy statistics Show statistics related to password policy
and I don't know where the first three columns came from.
What am I doing wrong?

Related

How to combine sensor data for plotting

I'm testing a light sensor for sensitivity. I now have data that I would like to plot.
The sensor has 24 levels of sensitivity
I'm only testing 0,6,12,18 and 23
On the x-axes: PWM value, range 0-65000
My goal is to plot from a dataframe using with plotly.
My question is:
How can I combine the data (as shown below) into a dataframe for plotting?
EDIT: The link to my csv files: https://filetransfer.io/data-package/QwzFzT8O
Also below: my code so far
Thanks!
def main_code():
data = pd.DataFrame(columns=['PWM','sens_00','sens_06','sens_12','sens_18','sens_23'])
sens_00 = pd.read_csv('sens_00.csv', sep=';')
sens_06 = pd.read_csv('sens_06.csv', sep=';')
sens_12 = pd.read_csv('sens_12.csv', sep=';')
sens_18 = pd.read_csv('sens_18.csv', sep=';')
sens_23 = pd.read_csv('sens_23.csv', sep=';')
print(data)
print(sens_23)
import plotly.express as px
import pandas as pd
if __name__ == '__main__':
main_code()

#Dawid's answer is fine, but it does not produce a complete dataframe (so you can do more than just plotting), and contains too much redundancy.
Below is a better way to concatenate the multiple csv files.
Then plotting is just a single call.
Reading csv files into a single dataframe:
from pathlib import Path
import pandas as pd
def read_dataframes(data_root: Path):
# It can be turned into a single line
# but keeping it more readable here
dataframes = []
for fpath in data_root.glob("*.csv"):
df = pd.read_csv(fpath, sep=";")
df = df[["pwm", "lux"]]
df = df.rename({"lux": fpath.stem}, axis="columns")
df = df.set_index("pwm")
dataframes.append(df)
return pd.concat(dataframes)
data_root = Path("data")
df = read_dataframes(data_root)
df
sens_06 sens_18 sens_12 sens_23 sens_00
pwm
100 0.00000 NaN NaN NaN NaN
200 1.36435 NaN NaN NaN NaN
300 6.06451 NaN NaN NaN NaN
400 12.60010 NaN NaN NaN NaN
500 20.03770 NaN NaN NaN NaN
... ... ... ... ... ...
64700 NaN NaN NaN NaN 5276.74
64800 NaN NaN NaN NaN 5282.29
64900 NaN NaN NaN NaN 5290.45
65000 NaN NaN NaN NaN 5296.63
65000 NaN NaN NaN NaN 5296.57
[2098 rows x 5 columns]
Plotting:
df.plot(backend="plotly") # equivalent to px.line(df)

Here is my suggestion. You have two columns in each file, and you need to use unique column names to keep both columns. All files are loaded and appended to the empty DataFrame called data. To generate a plot with all columns, you need to specify it by fig.add_scatter. The code:
import pandas as pd
import plotly.express as px
def main_code():
data = pd.DataFrame()
for filename in ['sens_00', 'sens_06', 'sens_12', 'sens_18', 'sens_23']:
data[['{}-PWM'.format(filename), '{}-LUX'.format(filename)]] = pd.read_csv('{}.csv'.format(filename), sep=';')
print(data)
fig = px.line(data_frame=data, x=data['sens_00-PWM'], y=data['sens_00-LUX'])
for filename in ['sens_06', 'sens_12', 'sens_18', 'sens_23']:
fig.add_scatter(x=data['{}-PWM'.format(filename)], y=data['{}-LUX'.format(filename)], mode='lines')
fig.show()
if __name__ == '__main__':
main_code()

Based on the suggestion by #Dawid
This is what I was going for.

Python Pandas read_csv quote issue, impossible to separate data

I'm working with several csv with Pandas. I changed some data name on the original csv file and saved the file. Then, I restarted and reloaded my jupyter notebook but now I got something like this for all dataframe I charged the data source :
Department Zone Element Product Year Unit Value
0 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
1 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
2 U1,"Z5","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
3 U1,"Z6","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
4 U1,"Z9","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
I tried to use sep=',', encoding='UTF-8-SIG',quotechar='"', quoting=0, engine='python' but same issue. I don't know how to parse the csv because even when I created a new csv form the data (without the quote and separator as ; ) the same issue appears...
csv is 321 rows, as this example with the problem : https://www.cjoint.com/c/LDCmfvq06R6
and original csv file without problem in Pandas : https://www.cjoint.com/c/LDCmlweuR66

I thing problem with quotes of the file
import csv
df = pd.read_csv('LDCmfvq06R6_FAOSTAT.csv', quotechar='"',
delimiter = ',',
quoting=csv.QUOTE_NONE,
on_bad_lines='skip')
for i, col in enumerate(df.columns):
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')
df.head()

Automising the plot of more than a 100 .txt files using pandas, NaN problems

Good afternoon
I am trying to import more than a 100 separate .txt files containing data I want to plot. I would like to automise this process, since doing the same iteration for every individual file is most tedious.
I have read up on how to read multiple .txt files, and found a nice explanation. However, following the example all my data gets imported as NaNs. I read up some more and found a more reliable way of importing .txt files, namely by using pd.read_fwf() as can be seen here.
Although I can at least see my data now, I have no clue how to plot it, since the data is in one column separated by \t, e.g.
0 Extension (mm)\tLoad (kN)\tMachine extension (mm)\tPreload extension
1 0.000000\t\t\t
2 0.152645\t0.000059312\t.....
... etc.
I have tried using different separators in both the pd.read_csv() and pd.read_fwf() including ' ', '\t' and '-s+', but to now avail.
Of course this causes a problem, because now I can not plot my data. Speaking of, I am also not sure how to plot the data in the dataframe. I want to plot each .txt file's data separately on the same scatter plot.
I am very new to stack overflow, so pardon the format of the question if it does not conform to the normal standard. I attach my code below, but unfortunately I can not attach my .txt files. Each .txt file contains about a thousand rows of data. I attach a picture of the general format of all the files. General format of the .txt files.
import numpy as np
import pandas as pd
from matplotlib import pyplot as pp
import os
import glob
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
# get the file names
leggername = [i for i in glob.glob("*.txt")]
# put everything in a dataframe
df = [pd.read_fwf(legger) for legger in leggername]
df
EDIT: the output I get now for the DataFrame is:
[ Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.152645\t0.000059312\t-...
4
... ...
997 76.0173\t0.037706\t0.005...
998
999 76.1699\t0.037709\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.128151\t0.000043125\t-...
4
... ...
997 63.8191\t0.034977\t-0.00...
998
999 63.9473\t0.034974\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.174403\t0.000061553\t0...
4
... ...
997 86.8529\t0.036093\t-0.00...
998
999 87.0273\t\t-0.0059160\t-...
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
... etc

The basic gist is to skip the first data row (that has a single value in it), then read the individual files with pd.read_csv, using tab as the separator, and stack them together.
There is, however, a more problematic issue: the data files turn out to be UTF-16 encoded (the binary data show a NUL character at the even positions), but there is no byte-order-mark (BOM) to indicate this. As a result, you can't specify the encoding in read_csv, but have to manually read each file as binary, then decode it with UTF-16 to a string, then feed that string to read_csv. Since the latter requires a filename or IO-stream, the text data needs to be put into a StringIO object first (or save the corrected data to disk first, then read the corrected file; might not be a bad idea).
import pandas as pd
import os
import glob
import io
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
dfs = []
for filename in glob.glob("*.txt"):
with open(filename, 'rb') as fp:
data = fp.read() # a single file should fit in memory just fine
# Decode the UTF-16 data that is missing a BOM
string = data.decode('UTF-16')
# And put it into a stream, for ease-of-use with `read_csv`
stream = io.StringIO(string)
# Read the data from the, now properly decoded, stream
# Skip the single-value row, and use tabs as separators
df = pd.read_csv(stream, sep='\t', skiprows=[1])
# To keep track of the individual files, add an "origin" column
# with its value set to the corresponding filename
df['origin'] = filename
dfs.append(df)
# Concate all dataframes (default is to stack the rows)
df = pd.concat(dfs)
# For a quick and dirty plot, you can enjoy the power of Seaborn
import seaborn as sns
# Use appropriate (full) column names, and use the 'origin'
# column for the hue and symbol
sns.scatterplot(data=df, x='Time (s)', y='Machine Extension (mm)', hue='origin', style='origin')
Seaborn's scatterplot documentation.

Python multiple pivot from the same column

I have a dataframe with just one column with content like:
view: meta_record_extract
dimension: e_filter
type: string
hidden: yes
sql: "SELECT * FROM files"
dimension: category
type: string
...
What I tried to produce would be a dataframe with columns and the data like this:
____________________________________________________________________
view | dimension |label | type | hidden | sql |
meta_record_extract| e_filter | NaN | string| yes |"SELECT * FROM files"
NaN | category | NaN | string ...
Given that splitting the string data like
df.header[0].split(': ')[0]
was giving me label with [0] or value with [1]
I tried this:
df.pivot_table(df, columns = df.header.str.split(': ')[0], values = df.header.str.split(': ')[1])
but it did not work giving the error.
Can anyone help me to achieve the result I need?

Use str.findall() + map, as follows:
str.findall() helps you extract the keyword and value pairs into a list. We then map the list of keyword-value pairs into a dict for pd.Dataframe to turn the dict into a dataframe.
(Assuming the column label of your column is Col1):
df_extract = df['Col1'].str.findall(r'(\w+):\s*(.*)')
df_result = pd.DataFrame(map(dict, df_extract))
Result:
print(df_result)
view dimension type hidden sql
0 meta_record_extract NaN NaN NaN NaN
1 NaN e_filter NaN NaN NaN
2 NaN NaN string NaN NaN
3 NaN NaN NaN yes NaN
4 NaN NaN NaN NaN "SELECT * FROM files"
5 NaN category NaN NaN NaN
6 NaN NaN string NaN NaN
Update
To compress the rows to minimize the NaN's, we can further use .apply() with .dropna(), as follows:
df_compressed = df_result.apply(lambda x: pd.Series(x.dropna().values))
Result:
print(df_compressed)
view dimension type hidden sql
0 meta_record_extract e_filter string yes "SELECT * FROM files"
1 NaN category string NaN NaN

How to load a text file of data with many commented rows, into pandas?

I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().
Example:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None)
0
0 1850 1 -0.777 0.412 NaN NaN...
1 1850 2 -0.239 0.458 NaN NaN...
2 1850 3 -0.426 0.447 NaN NaN...
3 1850 4 -0.680 0.367 NaN NaN...
4 1850 5 -0.687 0.298 NaN NaN...
If I set sep = ' ', I get another error:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None,
sep = ' ')
ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58
Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.
Notably this works when I use np.loadtxt():
pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comments = '%'))
0 1 2 3 4 5 6 7 8 9 10 11
0 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN
4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN
This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).
I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.
What am I getting wrong?

This one is quite tricky. Please try out the snippet code below:
import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
sep='\s+',
comment='%',
usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',
'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
'20y.Anomaly', '20y.Unc.'))

The issue is the file has 77 rows of commented text, for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
Two of the rows are headers
There's a bunch of data, then there are two more headers, and a new set of data for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
This solution separates the two tables in the file into separate dataframes.
This is not as nice as the other answer, but the data is properly separated into different dataframes.
The headers were a pain, it would probably be easier to manually create a custom header, and skip the lines of code for separating the headers from the text.
The important point separating air and ice data.
import requests
import pandas as pd
import math
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# specify the data from the ranges in the file
air_header1 = data[74].split() # not used
air_header2 = [v.strip() for v in data[75].split(',')]
# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]
h2o_header1 = data[2129].split() # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]
# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)
Without the header code
Simplify the code, by using a manual header list.
import pandas as pd
import requests
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
'Annual_Anomaly', 'Annual_Unc.',
'Five-year_Anomaly', 'Five-year_Unc.',
'Ten-year_Anomaly', 'Ten-year_Unc.',
'Twenty-year_Anomaly', 'Twenty-year_Unc.']
# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)
air
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
h2o
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenating multiple csv files into one dataframe gives me additional column - python

Related

How to combine sensor data for plotting

Python Pandas read_csv quote issue, impossible to separate data

Automising the plot of more than a 100 .txt files using pandas, NaN problems

Python multiple pivot from the same column

How to load a text file of data with many commented rows, into pandas?

Categories

Resources