I have a pandas dataframe (df) that looks like this:
A B C
0 1 10 1234
1 2 20 0
I want to save this dataframe in a fixed format. The fixed format I have in mind has different column width and is as follows:
"one space for column A's value then a comma then four spaces for column B's values and a comma and then five spaces for column C's values"
Or symbolically:
-,----,-----
My dataframe above (df) would look like the following in my desired fixed format:
1, 10, 1234
2, 20, 0
How can I write a command in Python that saves my dataframe into this format?
df['B'] = df['B'].apply(lambda t: (' '*(4-len(str(t)))+str(t)))
df['C'] = df['C'].apply(lambda t: (' '*(5-len(str(t)))+str(t)))
df.to_csv('path_to_file.csv', index=False)
Related
I have data set of over 100,000 rows and 10 columns. these 10 columns should have numeric values but 1% contents in these 10 columns are alpha and alphanumeric.
how do I use FOR loop or any faster method/function to change the values of all alpha and alphanumeric cells to mean of each column or to any numeric values?
e.g. column a b c & d
a b c d
1 2 5 f5
5 e5 9 6
tg 56 8 r5
q2 4 75 g
above dataset is just an example.
I am looking for any solution you may have.
You can use pd.to_numeric, more details here. This will make the column numeric.
You can add the key-word argument errors = 'coerce', which will replace unconvertible values like the ones containing alphanumeric characters with NaN. You can then replace these NaNs with the mean value of the column later, using DataFrame.fillna.
pd.to_numeric only works on Series, so you would have to do it on each column, but you can also apply it to the entire DataFrame like this:
df = df.apply(pd.to_numeric, errors = "coerce")
Full example:
df = df.apply(pd.to_numeric, errors = "coerce")
df = df.fillna(df.mean())
I have a csv having rows like this:
Year 1
Year 1
Year 1
Year 1
Month 1
Month 2
Month 3
Month 4
I want these first two columns to be merged into one like this:
| Year1-Month1 | Year1-Month2 | etc.
I am reading the csv using pandas dataframe.
All the answers on stack overflow combine the two columns but not rows. Please help.
First convert first 2 rows of data to MultiIndex:
df = pd.read_csv(file, header=[0, 1])
And then join values by -:
df.columns = df.columns.map('-'.join)
Or use f-strings:
df.columns = [f'{a}-{b}' for a, b in df.columns]
I imported a dataset looks like this.
Peak, Trough
0 1857-06-01, 1858-12-01
1 1860-10-01, 1861-06-01
2 1865-04-01, 1867-12-01
3 1869-06-01, 1870-12-01
4 1873-10-01, 1879-03-01
5 1882-03-01, 1885-05-01
6 1887-03-01, 1888-04-01
it is a CSV file. But when I check the .shape, it is
(7, 1)
I thought CSV file can automatically be seperated by its commas, however this one doesn't work.
I want to split this column into two, sperated by comma, and also the column names as well. How can I do that?
Use 'sep' tag in read_csv
It's like:
df = read_csv(path, sep = ', ')
Same data to text file or csv and then use read_csv with parameter skipinitialspace=True and parse_dates for convert values to datetimes:
df = pd.read_csv('data.txt', skipinitialspace=True, parse_dates=[0,1])
print (df.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object
If data are in excel in one column is possible use Series.str.split by first column, convert to datetimes and last set new columns names:
df = pd.read_excel('data.xlsx')
df1 = df.iloc[:, 0].str.split(', ', expand=True).apply(pd.to_datetime)
df1.columns = df.columns[0].split(', ')
print (df1.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df1.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object
I have a dataframe with a variable number of columns and with are handled inside MultiIndex for the columns. I'm trying to add several columns into the same MultiIndex structure
I've tried to add the new columns like if I would if there was only one column but it doesn't work
I have tried this:
df = pd.DataFrame(np.random.rand(4,2), columns=pd.MultiIndex.from_tuples([('plus_zero', 'A'), ('plus_zero', 'B')]))
df['plus_one'] = df['plus_zero'] + 1
But I get ValueError: Wrong number of items passed 2, placement implies 1.
The original df should look like this:
plus_zero
A B
0 0.602891 0.701130
1 0.395749 0.960206
2 0.268238 0.140606
3 0.165802 0.971707
And the result I want:
plus_zero plus_one
A B A B
0 0.602891 0.701130 1.602891 1.701130
1 0.395749 0.960206 1.395749 1.960206
2 0.268238 0.140606 1.268238 1.140606
3 0.165802 0.971707 1.165802 1.971707
Using pd.concat:
You must specify the names of the new columns and the axis=1 or axis='columns'
pd.concat([df.loc[:,'plus_zero'],df.loc[:,'plus_zero']+1],
keys=['plus_zero','plus_one'],
axis=1)
plus_zero plus_one
A B A B
0 0.049735 0.013907 1.049735 1.013907
1 0.782054 0.449790 1.782054 1.449790
2 0.148571 0.172844 1.148571 1.172844
3 0.875560 0.393258 1.875560 1.393258
I have a pandas dataframe with a integer column called TDLINX. I'm trying to convert that to a string with leading zeros such that all values are 7 characters, with leading zeros. So 7 would become "0000007"
This is the code that I used:
df_merged_total['TDLINX2'] = df.TDLINX.apply(lambda x: str(x).zfill(7))
At first glance this appeared to work, but as I went further down the file, I realized that the value in TDLINX2 was starting to get shifted. What could be causing this and what can I do to prevent it?
You could do something like this:
>>> df = pd.DataFrame({"col":[1, 33, 555, 7777]})
>>> df["new_col"] = ["%07d" % x for x in df.col]
>>> df
col new_col
0 1 0000001
1 33 0000033
2 555 0000555
3 7777 0007777