concat a DataFrame with a Series in Pandas

concat a DataFrame with a Series in Pandas - python

Can someone explain what is wrong with this pandas concat code, and why data frame remains empty ?I am using anaconda distibution, and as far as I remember it was working before.

You want to use this form:
result = pd.concat([dataframe, series], axis=1)
The pd.concat(...) doesn't happen "inplace" into the original dataframe but it would return the concatenated result so you'll want to assign the concatenation somewhere, e.g.:
>>> import pandas as pd
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame()
>>> df = pd.concat([df, s], axis=1) # We assign the result back into df
>>> df
0
0 1
1 2
2 3

Related

Assign a Series with its index to a multiIndex dataframe with loc

With a dataframe of an index level of 2, either empty or filled with something:
import pandas as pd
midx = pd.MultiIndex(levels=[[],[]],
codes=[[],[]],
names=[u'var_name', u'modalities']
)
df = pd.DataFrame(index=midx)
df.loc[("foo","bar"),"A"] = 3
df
### Returns
A
var_name modalities
foo bar 3.0
I want to assign the values of a series:
s = pd.Series([1,2,3], index=["a","b","c"])
Such that the results is:
A
var_name modalities
foo bar 3.0
baz a 1.0
b 2.0
c 3.O
How could I get that with loc or another solution ?
df.loc[("baz", s.index), "A"] = s does not work.

I did not find the method with loc.
My current solution is to use the concat method, which feels a bit unnatural to me (The first answer with a solution will get the checkmark).
Solution with concat
import pandas as pd
midx = pd.MultiIndex(levels=[[],[]],
codes=[[],[]],
names=[u'var_name', u'modalities']
)
df = pd.DataFrame(index=midx)
df.loc[("foo","bar"),"A"] = 3
s = pd.Series([1,2,3], index=["a","b","c"], name="A")
s_midx =pd.concat([s], keys=["baz"])
df_s_midx=pd.DataFrame(s_midx) # Is there a way to avoid this?
pd.concat([df, df_s_midx])
There is probably a way to avoid converting the series s_midx to a dataframe, but I did not find it!

how to make new data to be default type or default value (instead of nan) in pandas.dataframe

i'm dynamicly constructing a df with pandas.
where i wish the new data(element) when is added default to be a specific type or value, instead of nan. could this be possible?
like:
import pandas as pd
df = pd.DataFrame()
df.at[1,["a","b","c"]] = "a"
df.at[2,["a","c"]] = "b"
print(df)
you got:
a b c
1 a a a
2 b NaN b
where df.at[2,"b"] is set by pd default as "nan", but i wish it could default to be empty string("");
i don't want to use pd.isna() or replace() to check and assign value in each loop when i dynamicly create this df;
is there a way, like set this pd to be string type when initial the pd?
like:
df = pd.DataFrame(dtype=str)
(which i tried and seemed no work)
---update---
the full code is something like:
df = pd.DataFrame()
for loop:
df.at[1,["a","b"]] = "a"
df.at[2,["a"]] = "b"
x = df.at[2,"b"] + "hi"
where line of x is error(float type + string type), if i don't use some if/else to check nan before.
in this case, i think df.fillna("") is much better than isna()/replace() but still a little bit limit.
thx again:)

How about using df.fillna("") after the dataframe creation. In this way you fill the nan value with a specified value.
import pandas as pd
df = pd.DataFrame()
df['a'] = ["a","b","c"]
df['b'] = ["a","c", np.nan]
df = df.fillna("")
Then, you get this
a b
0 a a
1 b c
2 c

Remove Column with Duplicate Values in Pandas

I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.

There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]

One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1

Pandas dataframe: Can you assign a label for the column names and/or the df values?

When you define a dataframe in pandas in the following manner
df = pd.DataFrame([['07-Dec-2015', 1,2],
['08-Dec-2015', 3,4],
['09-Dec-2015', 5,6]],
columns=['Date','FR','UK'])
df.set_index('Date')
Out[1]:
FR UK
Date
07-Dec-2015 1 2
08-Dec-2015 3 4
09-Dec-2015 5 6
is there a way to assign a label to the columns (let's say 'Country') and another label for the dataframe values (lets say 'Hits'). I would like to make it look like this:
As a side note: The dataframe in the attached img above has been created as follows:
df = pd.DataFrame()
df['Date'] = ['07-Dec-2015','07-Dec-2015','08-Dec-2015','08-Dec-2015','09-Dec-2015','09-Dec-2015']
df['Country'] = ['UK','FR','UK','FR','UK','FR']
df['Hits'] = [2,1,4,3,6,5]
df = df.set_index(['Date','Country'])
df.unstack()
However this is not good enough for my purpose because in my python application the dataframe constructor is getting passed a numpy array and for the index arg a datetime vector, hence broadly speaking it looks like: pd.DataFrame(numpy.ndarray, columns=columnNames, index=DatetimeIndex)
Thanks in advance

You could:
df = pd.DataFrame(np.random.random((10, 2)), index=pd.DatetimeIndex(start=date(2015,1,1), periods=10, freq='D'))
df.index.name = 'Date'
df.columns = pd.MultiIndex.from_product([['Hits'], ['UK', 'FR']], names=['', 'Country'])
See MultiIndex docs.

Using Pandas to create DataFrame with Series, resulting in memory error

I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items
>>> prcpSeries.shape
(12626172,)
I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.
d = {'prcp': pd.Series(prcpSeries),
'tmax': pd.Series(tmaxSeries),
'tmin': pd.Series(tminSeries),
'ndvi': pd.Series(ndviSeries),
'lstm': pd.Series(lstmSeries),
'evtm': pd.Series(evtmSeries)}
df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None
But my code get stuck at following line giving a Memory Error
df = pd.DataFrame(d)
Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?

If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:
df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...
Note: you can also use the to_frame method (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):
df = prcpSeries.to_frame(name='prcp')
However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):
df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...
df = pd.concat([df1, df2, ...], join='outer', axis=1)
For example:
In [21]: dfA = pd.DataFrame([1,2], columns=['A'])
In [22]: dfB = pd.DataFrame([1], columns=['B'])
In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

concat a DataFrame with a Series in Pandas - python

Can someone explain what is wrong with this pandas concat code, and why data frame remains empty ?I am using anaconda distibution, and as far as I remember it was working before.

Related

Assign a Series with its index to a multiIndex dataframe with loc

how to make new data to be default type or default value (instead of nan) in pandas.dataframe

Remove Column with Duplicate Values in Pandas

Pandas dataframe: Can you assign a label for the column names and/or the df values?

Using Pandas to create DataFrame with Series, resulting in memory error

Categories

Resources