reading multiple files with glob duplicates columns - python

I'm trying read many txt files into my data frame and this code works below. However, it duplicates some of my columns, not all of them. I couldn't find a solution. What can I do to prevent this?
import pandas as pd
import glob
dfs = pd.DataFrame(pd.concat(map(functools.partial(pd.read_csv, sep='\t', low_memory=False),
glob.glob(r'/folder/*.txt')), sort=False))
Let's say my data should look like this:
enter image description here
But it looks like this:
enter image description here
I don't want my columns to be duplicated.

Could you give us a bit more information? Especially the output of dfs.columns would be useful. I suspect there could be some extra spaces in your column names which would cause pandas to differ between those.
Also you could try dask for that:
import dask.dataframe as dd
dfs = dd.read_csv(r'/folder/*.text, sep='\t').compute()
is a bit simpler and should give the same result

It is important to think about the concat process as having two possible outcomes. By choosing the axis, you can add new columns like the example (I) below or as new rows illustrated in example (II). pd.concat lets you do this by setting the axis to either 0 (rows) or 1 (columns).
Read more in the excellent documentation: concat
Example I:
import pandas as pd
import glob
pd.concat([pd.read_csv(f) for f in glob.glob(r'/folder/*.txt')], axis=1)
Example II:
pd.concat([pd.read_csv(f) for f in glob.glob(r'/folder/*.txt')], axis=0)

Related

Importing multiple excel files with similar name, pivoting each excel file and then appending the results into a single file

My problem statement is as above. Below is my progress so far
I want to extract multiple excel files from the same location namely
Test1
Test2
Test3...(I am using glob to do this)
(DONE)
2. I want to iterate through the folder and find files starting with a string(DONE)
3. I then formed an empty dataframe. I want to then pivot the 1st file dataframe based on the date(as columns), go to the next file (do the same), and then append my results to a dataframe.
My problem right now is that I am appending all results to the pivot that I created using my first file.
Can someone please help.
import pandas as pd
import numpy as np
import glob
glob.glob("C:/Users/Tom/Desktop/DC")
all_data = pd.DataFrame()
for f in glob.glob("C:/Users/Tom/Desktop/DC/Test?.xlsx"):
df = pd.read_excel(f)
pivot = pd.pivot_table(df, index='DC Desc', columns='Est Wk End Date', values=['Shipped/Ordered Units',aggfunc='sum')
all_data = all_data.append(pivot,ignore_index=True)
all_data.to_excel("outputappended2.xlsx")
Edit.
Thanks so much for your response. This helps a lot. Can you also tell me how before concatenating the next pivot, I can add a new line so that I can differentiate between the results and also sort by date.
Eg. I am getting the following result
DC Desc Apr 24,21 Dec 1,2020 Feb 6,2021
a 5000
b 2000 4000
c 1000
and I am looking for
DC Desc Dec 1,2020 Apr 24,21 Feb 6,2021
a 5000
b 2000 4000
c 1000
Lookingfor
This was I can tell what information I am getting from the other files and also sort the columns. Any help is appreciated.
Your best alternative should be to use pd.concat. A simple approach that I like it is to create a processing function and then concatenate all the dataframes. Something like this:
import pandas as pd
import glob
def pivot_your_data(f):
df = pd.read_excel(f)
return pd.pivot_table(df, index='DC Desc', columns='Est Wk End Date', values='Shipped/Ordered Units', aggfunc='sum')
all_data = pd.concat([pivot_your_data(f) for f in glob.glob("C:/Users/Tom/Desktop/DC/Test*.xlsx")])
Then you could drop index or do more data processing, but the main point is to use pd.concat

Merge several .csv into one csv in python

Good evening,
So I have a huge amount of .csvs which I either want to change in one giant csv before reading it with pandas, or directly creating a df with all the .csvs in it. The .csvs all have two columns "timestamp" and "holdings". Now I want to merge them on the "timestamp"-column if they match with each other and create a new column for each "holdings"-column. So far I produced this:
import os
import glob
import pandas as pd
os.chdir("C/USer....")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
The output is a list with dfs. How do I merge them on "timestamp" column now? I tried to concate and merge already, but it always puts them in a single column.
What you are looking for is an outer join between the dataframes. Since the pandas merge function only operates between two dataframes, we need to loop over each dataframe and merge them individually. We can use the reduce iterator from functools to do this cleanly in one line:
import pandas as pd
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['timestamp'],
how='outer'), dfs)
Use the suffixes argument in the merge function to clean up your column headings.

Sorting big csv file with pandas-groupby and use function .mean()

I have a big csv file with 3 columns and lots of lines.
It looks something like this:enter image description here
Now I would like to have all the lines with ID1 grouped and get the mean of their values in C.
My code for this looks like that:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv', sep=";",decimal=",", index_col=0)
grouped = df.groupby(['A'])[['C']]
grouped.mean()
When running the code I get this Error:
But in the csv file I made sure that there are no NaN and no non-numerical values.
What can I do about this? Many Thanks!
DataError: No numeric types to aggregate
The error message indicates that your data types are non-numerical and can not be used by aggregating functions. Use
df.dtypes
to have a look on your data types. If they are not int/float, you have to convert them:
df['A'] = df['A'].astype(float)
and perform the groupby afterwards

Merging two excel files using python with mismatching sizes

I have been trying to merge those two excel files.
Those files are already ready to be joined just as you can see in my image example.
I have tried the solutions from the answer here using pandas and xlwt, but I still can not save both in one file.
Desired result is:
P.s: the two data frames may have mismatch columns and rows which should just be ignored. I am looking for a way to paste one in another using panda.
how can I approach this problem? Thank you in advance,
import pandas as pd
import numpy as np
df = pd.read_excel('main.xlsx')
df.index = np.arange(1, len(df) + 1)
df1 = pd.read_excel('alt.xlsx', header=None, names=list(df))
for i in list(df):
if any(pd.isnull(df[i])):
df[i] = df1[i]
print(df)
df.to_excel("<filename>.xlsx", index=False)
Try this. The main.xlsx is your first excel file while the alt.xlsx is the second one.

Plot diagram in Pandas from CSV without headers

I am new to plotting charts in python. I've been told to use Pandas for that, using the following command. Right now it is assumed the csv file has headers (time,speed, etc). But how can I change it to when the csv file doesn't have headers? (data starts from row 0)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv("P1541350772737.csv")
#df.head(5)
df.plot(figsize=(15,5), kind='line',x='timestamp', y='speed') # scatter plot
You can specify x and y by the index of the columns, you don't need names of the columns for that:
Very simple: df.plot(figsize=(15,5), kind='line',x=0, y=1)
It works if x column is first and y column is second and so on, columns are numerated from 0
For example:
The same result with the names of the columns instead of positions:
I may havve missinterpreted your question but II'll do my best.
Th problem seems to be that you have to read a csv that have no header but you want to add them. I would use this code:
cols=['time', 'speed', 'something', 'else']
df = pd.read_csv('useful_data.csv', names=cols, header=None)
For your plot, the code you used should be fine with my correction. I would also suggest to look at matplotlib in order to do your graph.
You can try
df = pd.read_csv("P1541350772737.csv", header=None)
with the names-kwarg you can set arbitrary column headers, this implies silently headers=None, i.e. reading data from row 0.
You might also want to check the doc https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Pandas is more focused on data structures and data analysis tools, it actually supports plotting by using Matplotlib as backend. If you're interested in building different types of plots in Python you might want to check it out.
Back to Pandas, Pandas assumes that the first row of your csv is a header. However, if your file doesn't have a header you can pass header=None as a parameter pd.read_csv("P1541350772737.csv", header=None) and then plot it as you are doing it right now.
The full list of commands that you can pass to Pandas for reading a csv can be found at Pandas read_csv documentation, you'll find a lot of useful commands there (such as skipping rows, defining the index column, etc.)
Happy coding!
For most commands you will find help in the respective documentation. Looking at pandas.read_csv you'll find an argument names
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly
pass header=None.
So you will want to give your columns names by which they appear in the dataframe.
As an example: Suppose you have this data file
1, 2
3, 4
5, 6
Then you can do
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("data.txt", names=["A", "B"], header=None)
print(df)
df.plot(x="A", y="B")
plt.show()
which outputs
A B
0 1 2
1 3 4
2 5 6

Categories

Resources